AWS Big Data stack components overview
- Elastic MapReduce (EMR)
- Redshift
- DynamoDB
- Data Pipline (ETL tool)
- Simple Storage Service (S3)
- Jaspersoft AWS
- Kinesis (streaming data)
Elastic MapReduce (MapReduce - processing algorythm)
Amazon implementation of Hadoop
Hadoop-on-Demand
Integrated with S3 (Simple Storage Service)
Amazon distro or MapR
MapR - Unlike other Hadoop distributions that require separate clusters for multiple applications, the MapR Platform is built to process both distributed files, database tables, and event streams in one unified layer – an engineering feat in its own right. This enables organizations to support both operational (e.g., HBase) and analytic apps (e.g., Apache Drill, Hive, or Impala) on one cluster, significantly reducing costs as you grow your big data deployment. https://www.mapr.com/why-hadoop/why-mapr
Redshift
Cloud-based, Massively Parallel Processing (MPP), column store data warehouse.
Uses common relational, SQL technology.
Integrated with S3 and DynamoDB
DynamoDB
Based on Dynamo, Amazon's internal, seminal Key-Value store
Accommodates
unstructured data - no schema needs to be declared
Replaced Amazon SimpleDB
Data Pipline (ETL tool - Extract Transform Load)
A workflow system for shaping data and moving data from table to table, DB to DB +=>
Serves as an Integration tool for AWS Big Data stack components (moves components)
Build pipelines graphically (WEB) or programmatically (scripts)
Works on a scheduled, batch bases
Integrates with RDS/MySQL (Relational Database Service from Amazon - SQL distributed solution)
Important Acronyms
AWS Amazon Web Services
EC2 Elastic Compute Cloud
AMI Amazon Machine Image
S3 Simple Storage Service
EMR Elastic MapReduce
VPC Virtual Private Cloud
IAM Identity and Access Management
SSH Secure Socket Shell
Getting Set Up with AWS
Create an account
Create a Key pair
Create an S3 bucket
Install SSH client
Install S3 client
Install SQL Workbench, drivers