It is an Apache project combining:
MapReduce engine with Hadoop Distributed File System (HDFS)
HDFS allows many local disks of each Node operate in a Hadoop Cluster as a single pool of storage.
Files are replicated across nodes (by default 1 original has 2 copies, 3 in total)
Hadoop stack (from bottom to top)
- MapReduce + HDFS
- Database: HBase (NoSQL Database). HBase tables are HDFS files. Optionally HBase tables can be used as an input for MapReduce jobs or MapReduce job's output can create a new HBase table.
- Query: HiveQL (SQL abstraction layer over MapReduce) + Pig Latin (its commands corresponds to a different SQL commands: used for querying and stepwise data transformation as an ETL tool)
- RDBMS Import/Export: Sqoop (Additional component to Hive and Pig, moves data between Hadoop and any RDBMS)
- Machine Learning / Data Mining: Mahout
- Log file integration: Flume
EMR (Elastic MapReduce) is a Hadoop distro from AWS
It has in common: MapReduce and HDFS + Database + Hive and Pig.
It adds MPP \ Column Store: Impala (Cloudera)
Hive and Pig is an abstraction layer over MapReduce. Hive is a batch system.
Impala is an abstraction layer over HDFS. Impala is an interactive query engine.
No comments:
Post a Comment