Focus On Oracle

Installing, Backup & Recovery, Performance Tuning,
Troubleshooting, Upgrading, Patching

Oracle Engineered System

当前位置: 首页 » 技术文章 » 开源之美

Hadoop Ecosystem


Distributed Filesystem
Apache HDFS The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. 1. 
2. Google FileSystem - GFS Paper 
3. Cloudera Why HDFS 
4. Hortonworks Why HDFS
Red Hat GlusterFS GlusterFS is a scale-out network-attached storage file system. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was announced as a commercially-supported integration of GlusterFS with Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server. 1. 
2. Red Hat Hadoop Plugin
Quantcast File System QFS QFS is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters. It is written in C++ and has fixed-footprint memory management. QFS uses Reed-Solomon error correction as method for assuring reliable access to data.
Reed–Solomon coding is very widely used in mass storage systems to correct the burst errors associated with media defects. Rather than storing three full versions of each file like HDFS, resulting in the need for three times more storage, QFS only needs 1.5x the raw capacity because it stripes data across nine different disk drives.
1. QFS site 
2. GitHub QFS 
3. HADOOP-8885
Ceph Filesystem Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph's main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available. The data is replicated, making it fault tolerant. 1. Ceph Filesystem site 
2. Ceph and Hadoop 
3. HADOOP-6253
Lustre file system The Lustre filesystem is a high-performance distributed filesystem intended for larger network and high-availability environments. Traditionally, Lustre is configured to manage remote data storage disk devices within a Storage Area Network (SAN), which is two or more remotely attached disk devices communicating via a Small Computer System Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.
With Hadoop HDFS the software needs a dedicated cluster of computers on which to run. But folks who run high performance computing clusters for other purposes often don't run HDFS, which leaves them with a bunch of computing power, tasks that could almost certainly benefit from a bit of map reduce and no way to put that power to work running Hadoop. Intel's noticed this and, in version 2.5 of its Hadoop distribution that it quietly released last week, has added support for Lustre: the Intel® HPC Distribution for Apache Hadoop* Software, a new product that combines Intel Distribution for Apache Hadoop software with Intel® Enterprise Edition for Lustre software. This is the only distribution of Apache Hadoop that is integrated with Lustre, the parallel file system used by many of the world's fastest supercomputers
2. Hadoop with Lustre 
3. Intel HPC Hadoop
Alluxio Alluxio, the world’s first memory-centric virtual distributed storage system, unifies data access and bridges computation frameworks and underlying storage systems. Applications only need to connect with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s memory-centric architecture enables data access orders of magnitude faster than existing solutions. 
In big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS. Alluxio brings significant performance improvement to the stack; for example, Baidu uses Alluxio to improve their data analytics performance by 30 times. Beyond performance, Alluxio bridges new workloads with data stored in traditional storage systems. Users can run Alluxio using its standalone cluster mode, for example on Amazon EC2, or launch Alluxio with Apache Mesos or Apache Yarn. 
Alluxio is Hadoop compatible. This means that existing Spark and MapReduce programs can run on top of Alluxio without any code changes. The project is open source (Apache License 2.0) and is deployed at multiple companies. It is one of the fastest growing open source projects. With less than three years open source history, Alluxio has attracted more than 160 contributors from over 50 institutions, including Alibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, Red Hat, UC Berkeley, and Yahoo. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.
1. Alluxio site
GridGain GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data and computations into memory. This work is done with the GGFS - Hadoop compliant in-memory file system. For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS. Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon:
  • GGFS allows read-through and write-through to/from underlying HDFS or any other Hadoop compliant file system with zero code change. Essentially GGFS entirely removes ETL step from integration.
  • GGFS has ability to pick and choose what folders stay in memory, what folders stay on disc, and what folders get synchronized with underlying (HD)FS either synchronously or asynchronously.
  • GridGain is working on adding native MapReduce component which will provide native complete Hadoop integration without changes in API, like Spark currently forces you to do. Essentially GridGain MR+GGFS will allow to bring Hadoop completely or partially in-memory in Plug-n-Play fashion without any API changes.
1. GridGain site
XtreemFS XtreemFS is a general purpose storage system and covers most storage needs in a single deployment. It is open-source, requires no special hardware or kernel modules, and can be mounted on Linux, Windows and OS X. XtreemFS runs distributed and offers resilience through replication. XtreemFS Volumes can be accessed through a FUSE component, that offers normal file interaction with POSIX like semantics. Furthermore an implementation of Hadoops FileSystem interface is included which makes XtreemFS available for use with Hadoop, Flink and Spark out of the box. XtreemFS is licensed under the New BSD license. The XtreemFS project is developed by Zuse Institute Berlin. The development of the project is funded by the European Commission since 2006 under Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the German projects MoSGrid, "First We Take Berlin", FFMK, GeoMultiSens, and BBDC. 1. XtreemFS site 2. Flink on XtreemFS . Spark XtreemFS
Distributed Programming
Apache Ignite Apache Ignite In-Memory Data Fabric is a distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It includes a distributed key-value in-memory store, SQL capabilities, map-reduce and other computations, distributed data structures, continuous queries, messaging and events subsystems, Hadoop and Spark integration. Ignite is built in Java and provides .NET and C++ APIs. 1. Apache Ignite 
2. Apache Ignite documentation
Apache MapReduce MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing. 1. Apache MapReduce 
2. Google MapReduce paper 
3. Writing YARN applications
Apache Pig Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.
Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
2.Pig examples by Alan Gates
JAQL JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. As its name implies, a primary use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data. For example, it can support XML, comma-separated values (CSV) data and flat files. A "SQL within JAQL" capability lets programmers work with structured SQL data while employing a JSON data model that's less restrictive than its Structured Query Language counterparts.
Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. 
JAQL was created by workers at IBM Research Labs in 2008 and released to open source. While it continues to be hosted as a project on Google Code, where a downloadable version is available under an Apache 2.0 license, the major development activity around JAQL has remained centered at IBM. The company offers the query language as part of the tools suite associated with InfoSphere BigInsights, its Hadoop platform. Working together with a workflow orchestrator, JAQL is used in BigInsights to exchange data between storage, processing and analytics jobs. It also provides links to external data and services, including relational databases and machine learning data.
1. JAQL in Google Code 
2. What is Jaql? by IBM
Apache Spark Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.
Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
1. Apache Spark 
2. Mirror of Spark on Github 
3. RDDs - Paper 
4. Spark: Cluster Computing... - Paper 
Spark Research
Apache Storm Storm is a complex event processor (CEP) and distributed computation framework written predominantly in the Clojure programming language. Is a distributed real-time computation system for processing fast, large streams of data. Storm is an architecture based on master-workers paradigma. So a Storm cluster mainly consists of a master and worker nodes, with coordination done by Zookeeper. 
Storm makes use of zeromq (0mq, zeromq), an advanced, embeddable networking library. It provides a message queue, but unlike message-oriented middleware (MOM), a 0MQ system can run without a dedicated message broker. The library is designed to have a familiar socket-style API.
Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. Storm was initially developed and deployed at BackType in 2011. After 7 months of development BackType was acquired by Twitter in July 2011. Storm was open sourced in September 2011. 
Hortonworks is developing a Storm-on-YARN version and plans finish the base-level integration in 2013 Q4. This is the plan from Hortonworks. Yahoo/Hortonworks also plans to move Storm-on-YARN code from to be a subproject of Apache Storm project in the near future.
Twitter has recently released a Hadoop-Storm Hybrid called “Summingbird.” Summingbird fuses the two frameworks into one, allowing for developers to use Storm for short-term processing and Hadoop for deep data dives,. a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system.
1. Storm Project/ 
2. Storm-on-YARN
Apache Flink Apache Flink (formerly called Stratosphere) features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.
Flink is a data processing system and an alternative to Hadoop's MapReduce component. It comes with its own runtime, rather than building on top of MapReduce. As such, it can work completely independently of the Hadoop ecosystem. However, Flink can also access Hadoop's distributed file system (HDFS) to read and write data, and Hadoop's next-generation resource manager (YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, it ships already the required libraries to access HDFS.
1. Apache Flink incubator page 
2. Stratosphere site
Apache Apex Apache Apex is an enterprise grade Apache YARN based big data-in-motion platform that unifies stream processing as well as batch processing. It processes big data in-motion in a highly scalable, highly performant, fault tolerant, stateful, secure, distributed, and an easily operable way. It provides a simple API that enables users to write or re-use generic Java code, thereby lowering the expertise needed to write big data applications.

The Apache Apex platform is supplemented by Apache Apex-Malhar, which is a library of operators that implement common business logic functions needed by customers who want to quickly develop applications. These operators provide access to HDFS, S3, NFS, FTP, and other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis, HBase, CouchDB and other databases along with JDBC connectors. The library also includes a host of other common business logic patterns that help users to significantly reduce the time it takes to go into production. Ease of integration with all other big data technologies is one of the primary missions of Apache Apex-Malhar.

Apex, available on GitHub, is the core technology upon which DataTorrent's commercial offering, DataTorrent RTS 3, along with other technology such as a data ingestion tool called dtIngest, are based.

1. Apache Apex from DataTorrent
2. Apache Apex main page 
3. Apache Apex Proposal
Netflix PigPen PigPen is map-reduce for Clojure which compiles to Apache Pig. Clojure is dialect of the Lisp programming language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine, Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media. 1. PigPen on GitHub
AMPLab SIMR Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scala installed on any of the nodes. 1. SIMR on GitHub
Facebook Corona “The next version of Map-Reduce" from Facebook, based in own fork of Hadoop. The current Hadoop implementation of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets. The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook engineers looked at but discounted because of the highly-customised nature of the company's deployment of Hadoop and HDFS. Corona, like YARN, spawns multiple job trackers (one for each job, in Corona's case). 1. Corona on Github
Apache REEF Apache REEF™ (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop™ YARN or Apache Mesos™. Apache REEF drastically simplifies development of those resource managers through the following features:
  • Centralized Control Flow: Apache REEF turns the chaos of a distributed application into events in a single machine, the Job Driver. Events include container allocation, Task launch, completion and failure. For failures, Apache REEF makes every effort of making the actual `Exception` thrown by the Task available to the Driver.
  • Task runtime: Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in every container of a REEF application. Evaluators can keep data in memory in between Tasks, which enables efficient pipelines on REEF.
  • Support for multiple resource managers: Apache REEF applications are portable to any supported resource manager with minimal effort. Further, new resource managers are easy to support in REEF.
  • .NET and Java API: Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a single REEF application is free to mix and match Tasks written for .NET or Java.
  • Plugins: Apache REEF allows for plugins (called "Services") to augment its feature set without adding bloat to the core. REEF includes many Services, such as a name-based communications between Tasks MPI-inspired group communications (Broadcast, Reduce, Gather, ...) and data ingress.
1. Apache REEF Website
Apache Twill Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their business logic. Twill uses a simple thread-based model that Java programmers will find familiar. YARN can be viewed as a compute fabric of a cluster, which means YARN applications like Twill will run on any Hadoop 2 cluster.
YARN is an open source application that allows the Hadoop cluster to turn into a collection of virtual machines. Weave, developed by Continuuity and initially housed on Github, is a complementary open source application that uses a programming model similar to Java threads, making it easy to write distributed applications. In order to remove a conflict with a similarly named project on Apache, called "Weaver," Weave's name changed to Twill when it moved to Apache incubation.
Twill functions as a scaled-out proxy. Twill is a middleware layer in between YARN and any application on YARN. When you develop a Twill app, Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java. It is very easy to build multi-processed distributed applications in Twill.
1. Apache Twill Incubator
Damballa Parkour Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete access to absolutely everything possible in raw Java Hadoop MapReduce. 1. Parkour GitHub Project
Apache Hama Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data analysis techniques such as machine learning and graph algorithms require iterative computations, this is where Bulk Synchronous Parallel model can be more effective than "plain" MapReduce. 1. Hama site
Datasalt Pangool A new MapReduce paradigm. A new API for MR jobs, in higher level than Java. 1.Pangool 
2.GitHub Pangool
Apache Tez Tez is a proposal to develop a generic application which can be used to process complex data-processing task DAGs and runs natively on Apache Hadoop YARN. Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. Tez is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility. Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases. Tez framework constitutes part of Stinger initiative (a low latency based SQL type query interface for Hadoop based on Hive). 1. Apache Tez Incubator 
2. Hortonworks Apache Tez page
Apache DataFu DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles, sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank, sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn. 1. DataFu Apache Incubator
Pydoop Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++ Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available. 1. SF Pydoop site 
2. Pydoop GitHub Project
Kangaroo Open-source project from Conductor for writing MapReduce jobs consuming data from Kafka. The introductory post explains Conductor’s use case—loading data from Kafka to HBase by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions which are limited to a single InputSplit per Kafka partition, Kangaroo can launch multiple consumers at different offsets in the stream of a single partition for increased throughput and parallelism. 1. Kangaroo Introduction 
2. Kangaroo GitHub Project
TinkerPop Graph computing framework written in Java. Provides a core API that graph system vendors can implement. There are various types of graph systems including in-memory graph libraries, OLTP graph databases, and OLAP graph processors. Once the core interfaces are implemented, the underlying graph system can be queried using the graph traversal language Gremlin and processed with TinkerPop-enabled algorithms. For many, TinkerPop is seen as the JDBC of the graph computing community. 1. Apache Tinkerpop Proposal 
2. TinkerPop site
Pachyderm MapReduce Pachyderm is a completely new MapReduce engine built on top Docker and CoreOS. In Pachyderm MapReduce (PMR) a job is an HTTP server inside a Docker container (a microservice). You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. Data is POSTed to the container over HTTP and the results are stored back in the file system. You can implement the web server in any language you want and pull in any library. Pachyderm also creates a DAG for all the jobs in the system and their dependencies and it automatically schedules the pipeline such that each job isn’t run until it’s dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows exactly which data has changed and which subsets of the pipeline need to be rerun. CoreOS is an open source lightweight operating system based on Chrome OS, actually CoreOS is a fork of Chrome OS. CoreOS provides only the minimal functionality required for deploying applications inside software containers, together with built-in mechanisms for service discovery and configuration sharing 1. Pachyderm site 
2. Pachyderm introduction article
Apache Beam Apache Beam is an open source, unified model for defining and executing data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model” and first implemented as Google Cloud Dataflow, including a Java SDK on GitHub for writing pipelines and fully managed service for executing them on Google Cloud Platform.

In January 2016, Google and a number of partners submitted the Dataflow Programming Model and SDKs portion as an Apache Incubator Proposal, under the name Apache Beam (unified Batch + strEAM processing).

1. Apache Beam Proposal 
2. DataFlow Beam and Spark Comparasion
NoSQL Databases
Column Data Model
Apache HBase Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables 1. Apache HBase Home 
2. Mirror of HBase on Github
Apache Cassandra Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra. This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra). HBase and its required supporting systems are derived from what is known of the original Google BigTable and Google File System designs (as known from the Google File System paper Google published in 2003, and the BigTable paper published in 2006). Cassandra on the other hand is a recent open source fork of a standalone database system initially coded by Facebook, which while implementing the BigTable data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon). 1. Apache HBase Home 
2. Cassandra on GitHub 
3. Training Resources 
4. Cassandra - Paper
Hypertable Database system inspired by publications on the design of Google's BigTable. The project is based on experience of engineers who were solving large-scale data-intensive tasks for many years. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++. Sposored by Baidu the Chinese search engine. TODO
Apache Accumulo Distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Accumulo is software created by the NSA with security features. 1. Apache Accumulo Home
Apache Kudu Distributed, columnar, relational data store optimized for analytical use cases requiring very fast reads with competitive write speeds.
  • Relational data model (tables) with strongly-typed columns and a fast, online alter table operation.
  • Scale-out and sharded with support for partitioning based on key ranges and/or hashing.
  • Fault-tolerant and consistent due to its implementation of Raft consensus.
  • Supported by Apache Impala and Apache Drill, enabling fast SQL reads and writes through those systems.
  • Integrates with MapReduce and Spark.
  • Additionally provides "NoSQL" APIs in Java, Python, and C++.
1. Apache Kudu Home
2. Kudu on Github
3. Kudu technical whitepaper (pdf)
Apache Parquet Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. 1. Apache Parquet Home
2. Apache Parquet on Github
Document Data Model
MongoDB Document-oriented database system. It is part of the NoSQL family of database systems. Instead of storing data in tables as is done in a "classical" relational database, MongoDB stores structured data as JSON-like documents 1. Mongodb site
RethinkDB RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn. 1. RethinkDB site
ArangoDB An open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript extensions. 1. ArangoDB site
Stream Data Model
EventStore An open-source, functional database with support for Complex Event Processing. It provides a persistence engine for applications using event-sourcing, or for storing time-series data. Event Store is written in C#, C++ for the server which runs on Mono or the .NET CLR, on Linux or Windows. Applications using Event Store can be written in JavaScript. Event sourcing (ES) is a way of persisting your application's state by storing the history that determines the current state of your application. 1. EventStore site
Key-Value Data Model
Redis DataBase Redis is an open-source, networked, in-memory, data structures store with optional durability. It is written in ANSI C. In its outer layer, the Redis data model is a dictionary which maps keys to values. One of the main differences between Redis and other structured storage systems is that Redis supports not only strings, but also abstract data types. Sponsored by Redis Labs. It’s BSD licensed. 1. Redis site 
2. Redis Labs site
Linkedin Voldemort Distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage. 1. Voldemort site
RocksDB RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. 1. RocksDB site
OpenTSDB OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable. 1. OpenTSDB site
Graph Data Model
ArangoDB An open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript extensions. 1. ArangoDB site
Neo4j An open-source graph database writting entirely in Java. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. 1. Neo4j site
TitanDB TitanDB is a highly scalable graph database optimized for storing and querying large graphs with billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users. 1. Titan site
NewSQL Databases
TokuDB TokuDB is a storage engine for MySQL and MariaDB that is specifically designed for high performance on write-intensive workloads. It achieves this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC compliant storage engine. TokuDB is one of the technologies that enable Big Data in MySQL. 1. Percona TokuDB site
HandlerSocket HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine of MySQL). It works as a daemon inside the mysqld process, accepting TCP connections, and executing requests from clients. HandlerSocket does not support SQL queries. Instead, it supports simple CRUD operations on tables. HandlerSocket can be much faster than mysqld/libmysql in some cases because it has lower CPU, disk, and network overhead. TODO
Akiban Server Akiban Server is an open source database that brings document stores and relational databases together. Developers get powerful document access alongside surprisingly powerful SQL. TODO
Drizzle Drizzle is a re-designed version of the MySQL v6.0 codebase and is designed around a central concept of having a microkernel architecture. Features such as the query cache and authentication system are now plugins to the database, which follow the general theme of "pluggable storage engines" that were introduced in MySQL 5.1. It supports PAM, LDAP, and HTTP AUTH for authentication via plugins it ships. Via its plugin system it currently supports logging to files, syslog, and remote services such as RabbitMQ and Gearman. Drizzle is an ACID-compliant relation
关键词:bigdata Hadoop open 


cmake(Write once, run everywhere)
Hadoop Ecosystem
AI Open platform h2o