PANIOV: HADOOP

Wednesday, April 6, 2016

HADOOP

Hadoop = MapReduce + HDFS

It is an Apache project combining:
MapReduce engine with Hadoop Distributed File System (HDFS)

HDFS allows many local disks of each Node operate in a Hadoop Cluster as a single pool of storage.
Files are replicated across nodes (by default 1 original has 2 copies, 3 in total)

Hadoop stack (from bottom to top)

MapReduce + HDFS
Database: HBase (NoSQL Database). HBase tables are HDFS files. Optionally HBase tables can be used as an input for MapReduce jobs or MapReduce job's output can create a new HBase table.
Query: HiveQL (SQL abstraction layer over MapReduce) + Pig Latin (its commands corresponds to a different SQL commands: used for querying and stepwise data transformation as an ETL tool)
RDBMS Import/Export: Sqoop (Additional component to Hive and Pig, moves data between Hadoop and any RDBMS)
Machine Learning / Data Mining: Mahout
Log file integration: Flume

EMR (Elastic MapReduce) is a Hadoop distro from AWS

It has in common: MapReduce and HDFS + Database + Hive and Pig.

It adds MPP \ Column Store: Impala (Cloudera)

Hive and Pig is an abstraction layer over MapReduce. Hive is a batch system.

Impala is an abstraction layer over HDFS. Impala is an interactive query engine.

Wednesday, April 6, 2016

HADOOP

No comments: