Wednesday, May 18, 2016

SPARK - pluralsight course Apache Spark Fundamentals, my notes


Resources about the use of Spark:
https://amplab.cs.berkeley.edu/for-big-data-moores-law-means-better-decisions/
https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
https://github.com/rozim/ChessData (PGN source) https://en.wikipedia.org/wiki/Portable_Game_Notation (PGN wiki)
https://spark.apache.org/examples.html

Spark installation's issues on Window 10:
The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- (on Windows)
How to fix it:
Be sure you have the right version of winutil.exe (its size should be > 100kb). The download link of it is here: http://letstalkspark.blogspot.com/2016/02/getting-started-with-spark-on-window-64.html
find out this line: e.g. -  Download winutils.exe  ( Put in C:\BigData\Hadoop\bin )  -- This is for 64-bit

Then execute this command:
C:\pathToWinutils\winutils.exe chmod -R 777 \tmp\hive

First task in Spark
(count word and sort them from default file e.g. README.md of spark package):

val textFile = sc.textFile("file:///spark/README.md")
val tokenizedFileData = textFile.flatMap(line => line.split(" "))
val countPrep = tokenizedFileData.map(word => (word, 1))
val counts = countPrep.reduceByKey((accumValue, newValue) => accumValue + newValue)
val sortedCounts = counts.sortBy(kvPair=>kvPair._2, false)
sortedCounts.saveAsTextFile("file:///PluralsightData/ReadMeWordCount")

Then checkout the files part-00000.txt and part-00001.txt stored in ReadMeWordCount folder in your FS.

Try other function on data set
tokenizedFileData.countByValue