Apache Spark Resilient Distributed Dataset (RDD) Programming Transformation Operations

Introduction

This article continues on from the preceeding article of the same topic: Apache Spark Resilient Distributed Dataset (RDD) Programming Part 1.

Providing Functions in Spark

A lot of Spark's transformations and a bit of its actions is dependent on providing a function as a parameter in order to compute data. In python there are three options for passing functions. 

  1. Lambda functions.
  2. Top level functions.
  3. Locally defined functions.

Below is an example:

apache = readme.filter(lambda line: "Apache" in line)

def containsApache(line):
return "Apache" in line
apache = readme.filter(containsApache)

Using the two different methods above will return the same RDD object into apache.

Be careful of passing in functions that serializes the object containing the function. This is usually when the function is a member of an object or has references to fields in an object (look for keywords such as self and field). A better way is to store these fields into a local variable before passing them.

Most used Transformation

In addition to the following transformation Spark contains some additional operations for RDD objects containing certain types of data.

Basic RDDs

There are a number of transformations and actions that can be performed on all RDDs reagardless of the objects contained inside of them. 

Common Transformations:

Element wise transformations: The most common element wise transformation would be the map() and filter() functions. The map() function takes in a function as a parameter and applies this function to each element in the RDD. The filter() function takes in a function and returns an RDD containing only elements that pass the filter() function. Note that the map() return type does not have to be the same type as the input type. Below is an example of a map() transformation.

vals = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
cubed = vals.map(lambda x: x**3).collect()
for number in cubed:
print("{}".format(number))
# 1
# 8
# 27
# 64
# 125
# 216
# 343
# 512
# 729
# 1000

The map() transformation is a 1 to 1 mapping. If we want to produce multiple output elements for each input ie 1 to many we can use the flatmap() transformation. Below is an example of flatmap() that splits an input string into words (1  string into many words):

strings = sc.parallelize(["Welcome to Programmathics", "hope you like it"])
strings.count()
# 2
words = strings.flatMap(lambda x:x.split(" "))
# 7

Pseudo set operations: You can transform data in RDDs using most of the operations aailable in mathematical sets. Examples of these include unions and intersections. It is important to remember that with these set operations the RDDs must be of the same type. Below is an example:

first = sc.parallelize(["programming","mathematics","machine learning", "data science"])
second = sc.parallelize(["mathematics", "big data", "spark", "data science"])

for word in first.distinct().collect():
print word
# mathematics
# data science
# programming
# machine learning

for word in first.union(second).collect():
print word
# programming
# mathematics
# machine learning
# data science
# mathematics
# big data
# spark
# data science

for word in first.intersection(second).collect():
print word
# mathematics
# data science

for word in first.subtract(second).collect():
print word
# programming
# machine learning

Note that intersection() and distinct() only gets unique values from the RDD. This means that these operations will be expensive as the data has to be shuffled over the entire cluster.

Another set operation we can perform on RDD objects is the cartesian product.

for tuple in first.cartesian(second).collect():
print tuple
# ('programming', 'mathematics')
# ('programming', 'big data')
# ('programming', 'spark')
# ('programming', 'data science')
# ('mathematics', 'mathematics')
# ('mathematics', 'big data')
# ('mathematics', 'spark')
# ('mathematics', 'data science')
# ('machine learning', 'mathematics')
# ('machine learning', 'big data')
# ('machine learning', 'spark')
# ('machine learning', 'data science')
# ('data science', 'mathematics')
# ('data science', 'big data')
# ('data science', 'spark')
# ('data science', 'data science')

Be weary that cartesian products are a very expensive operations if working with very large datasets.

Conclusion

This covers the most common transformation operations of Spark RDD objects. It is good to investigate the other transformation operations in more detail to suit your need. Please read my next article Apache Spark Resilient Distributed Dataset (RDD) Programming Action Operations to learn about the most common action operations in Spark.

Subscribe to our mailing list

* indicates required