Functional Programming in Spark
Spark is working through Functional Programming
After a developer is defining business logic as function, users can implement the function, delivered by nodes that use spark. Nodes can deliver the function by either the way of defining it in advance or using anonymous function
Below are the details of way of using Anonymous Function.
- Python : Using lambda notation (lambda x: …)
The types of Spark RDD
In the RDD, you can save the data in different types.
- Primitive type: integers, characters, Booleans etc…
- Sequence type: strings, lists, arrays, tuples, dictionarys etc…
- Scala & Java object type
Especially, there exits the useful RDD, named Pair RDD, that can save the data as a statue of Key-Value In the case of utilizing Pair RDD, this allows you to perform mathematical operation (Average or Sum) on the data and to use grouping operation (sorting, joining and counting) based on particular key like reduceByKey. Now let’s test word count working using map-reduce program, which is implemented by utilizing Pair RDD.
Word count by Spark Programming
- Read “a_thousand_winds.txt” file, which is saved on HDFS and save it on RDD
- Remove special character from RDD (Don’t remove space key)
- Convert every sentence you read to uppercase characters
- Divide each word based on spacebar as identifier
- Assign each word to key and make Pair RDD that has 1 in value
- Perform reduceByKey to find the number of word usage of generated PairRDD
- Change (Word, The number of usage) to (The number of usage, word)
- Sort by descending of number of word usage
- Restore (The number of usage, word) to (Word, The number of usage)
As we mentioned on Spark Part2, RDD of Spark performs every processing not on stage of Transformation but on stage of Action and it also follows the rule of Lazy Execution, saved on RDD
In case of re-performance of Action of RDD, the processor should read every data from RDD again. On the Spark, However, it can utilize used data repeatedly as saving RDD on memory. It helps you perform the same action effectively.
It supports two operations, is persist() and cache(), for frequent task by storing RDD on Spark memory. Below is the storage level that is able to be stored
It is a default option that RDD is stored in memory However, there is another option that it would store RDD on the disk in manner of swapping the part of insufficient on memory.
To save resource on the processor, RDD can be saved as the statue of serialization, not raw. In this case, the usage of memory is decreasing, but the usage of CPU is increasing because of overload that is generated on converting raw to serialization or De-Serialization.
Stored RDD can basically be removed by LRU algorithm. It also can be get rid of RDD from memory or disk explicitly using unpersistence() operation.
The next blog will be covering additional explanation for Fault-Tolerance function of Spark RDD using Persistence property.
Resources: Matei Zaharia(2012), “Resilient Distributed Datasets :A Fault-Tolerant Abstraction for In-Memory Cluster Computing”,
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업