Wednesday, April 6, 2016

IMPOSTOR SYNDROME - when an experienced programmer feel like a perpetual beginner

I've been programming for 6 years and I feel like I'm a perpetual beginner. Do many other programmers feel this way?


Answers from different programmers:

Yes, it's called Impostor Syndrome. If you want to see an army of beginners, walk into a high tech company full of experienced programmers.

Well, that's pretty the same thing for 99% of devs.

When you start questioning yourself, that's very good - means that from now on, the real programming starts.

We grow in cycles and you need to invest most patience when you doubt yourself the most! and just go to that online course or book or repl and just to a tiny little program. 

MapReduce. What is it?

MapReduce is compound from two main steps:

Map step (input files)
Map step is the process of going through the unformatted data and generating a series of key-value pairs.

Shuffle intermediate step (processing of the data of input files)
All of the values for a given key are collated into separated piles (common keys go to one folder).

Reduce step (output files)
Producer Node will count tally for each key (from all data values of that key is a folder)


HADOOP

Hadoop = MapReduce + HDFS 

It is an Apache project combining:
MapReduce engine with Hadoop Distributed File System (HDFS)

HDFS allows many local disks of each Node operate in a Hadoop Cluster as a single pool of storage.
Files are replicated across nodes (by default 1 original has 2 copies, 3 in total)

Hadoop stack (from bottom to top)

  1. MapReduce + HDFS
  2. Database: HBase (NoSQL Database). HBase tables are HDFS files. Optionally HBase tables can be used as an input for MapReduce jobs or MapReduce job's output can create a new HBase table.
  3. Query: HiveQL (SQL abstraction layer over MapReduce) + Pig Latin (its commands corresponds to a different SQL commands: used for querying and stepwise data transformation as an ETL tool)
  4. RDBMS Import/Export: Sqoop (Additional component to Hive and Pig, moves data between Hadoop and any RDBMS)
  5. Machine Learning / Data Mining: Mahout
  6. Log file integration: Flume




EMR (Elastic MapReduce) is a Hadoop distro from AWS
It has in common: MapReduce and HDFS + Database +  Hive and Pig.
It adds MPP \ Column Store: Impala (Cloudera)

Hive and Pig is an abstraction layer over MapReduce. Hive is a batch system.
Impala is an abstraction layer over HDFS. Impala is an interactive query engine.