Wednesday, April 6, 2016

HADOOP

Hadoop = MapReduce + HDFS 

It is an Apache project combining:
MapReduce engine with Hadoop Distributed File System (HDFS)

HDFS allows many local disks of each Node operate in a Hadoop Cluster as a single pool of storage.
Files are replicated across nodes (by default 1 original has 2 copies, 3 in total)

Hadoop stack (from bottom to top)

  1. MapReduce + HDFS
  2. Database: HBase (NoSQL Database). HBase tables are HDFS files. Optionally HBase tables can be used as an input for MapReduce jobs or MapReduce job's output can create a new HBase table.
  3. Query: HiveQL (SQL abstraction layer over MapReduce) + Pig Latin (its commands corresponds to a different SQL commands: used for querying and stepwise data transformation as an ETL tool)
  4. RDBMS Import/Export: Sqoop (Additional component to Hive and Pig, moves data between Hadoop and any RDBMS)
  5. Machine Learning / Data Mining: Mahout
  6. Log file integration: Flume




EMR (Elastic MapReduce) is a Hadoop distro from AWS
It has in common: MapReduce and HDFS + Database +  Hive and Pig.
It adds MPP \ Column Store: Impala (Cloudera)

Hive and Pig is an abstraction layer over MapReduce. Hive is a batch system.
Impala is an abstraction layer over HDFS. Impala is an interactive query engine.






No comments: