Apache Spark Resilient Distributed Dataset (RDD) Programming Part 1

RDD Basics

An RDD is just a immutable distributed collection of objects. RDDs are split into multiple partitions in order to be computed on different nodes of the cluster. RDDs can contain either Python, Java or Scala objects.

RDDs can be created in 2 ways. One way is to load external datasets and the other is to distribute a collection of objects (such as a list or a set) in the driver program (SparkContext). An example using text files can be found in this article: Apache Installation and Building Stand Alone Applications.

RDDs provide 2 types of operations. Transformations and actions. Transformations create new RDD from a previous one eg. filter(). Actions compute a result based on an RDD and either returns it to the driver program or saves it to an external storage system eg. first().

RDDs are computed in a lazy fashion, which means that Spark waits to see the entire chain of transformations before before computing which saves resources. If you wish to persist RDD for repeated actions you can do so via the persist() command. An example is below:

sparkLines.persist
sparkLines.count()
# 20
sparkLines.first()
# u'# Apache Spark'

Creating RDD

The easiest way to create and RDD in Spark is to parallelizing a collection object into your driver program. This is a good way to learn Spark but is not advized to be used in production as it requires that you have your entire dataset in memory on one machine which is unlikely when working with Big Data. Below is an example of using the parallelize() method:

rows = sc.parallelize(["Hello", "World"])

The most common way is to load data using external storage. An example of this has been covered using text files in this article: Apache Installation and Building Stand Alone Applications.

RDD Operations Transformations and Actions

RDD can either do a transformation or an action. A transformation create a new RDD whereas an action returns a result to the driver program or writes the data to storage. To know if a function is a transformation the function will return an object of type RDD. Action functions will return a different type.

Transformations

Transformations are operations on RDDs that returns RDD object. Most transformations work on an element wise basis but there are some exceptions to this general rule. Below is an example using the README.md file from Apache Spark that you would have installed.

readme = sc.textFile("README.md")
sparkLines = readme.filter(lambda line: "Spark" in line)
clusterLines = readme.filter(lambda line: "cluster" in line)
spark_cluster = sparkLines.union(clusterLines)

What we have done is create 4 RDD objects. readme is an RDD of the entire README.md file. sparkLines is an RDD object containing only lines that have "Spark". clusterLines is an RDD object containing lines that have "cluster". The final RDD object (spark_cluster) was created using the union transformation function which combines two RDD objects into one. Spark keeps track of the dependencies between RDDs called a lineage graph. This is used to recover lost data if part of some persistent data is lost. 

Actions

Actions is the second type of operation on RDD objects and is the operation that does something with the dataset. Actions returns a final value to the driver program or writes data to an external storage system.

An example of an action is count() and take(). Using the same example as in the Transformation section above we can use the following commands:

print("Input had {} lines containing Spark or cluster".format(spark_cluster.count()))
# Input had 23 lines containing Spark or cluster

print("Here are 15 examples: ")
for line in spark_cluster.take(15):
print line

RDDs also have a collect() function to get the entire RDD. When using collect make sure you have reduced it down in size so that it can be stored entirely into memory. Most of the time collect() is not used to return data to the driver because the dataset is too large. A more common thing to do is to write into an external data storage system.

One thing to keep in mind with action functions is that when it is called the entire RDD must be computed from scratch. If the transformations or process intensive it might be a better idea to persist the data.

Lazy Evaluation in Spark RDD

Transformations in RDD objects are lazily evluated. Spark will not begin to execute transformations until it sees an action. This is the same for loading data into RDD. For example the sc.textFile() function does not load data until it is required in Spark. This means that data loading can occur multiple times. To force a transformation to occur in spark you will need to call an action such as count().

Conclusion

This article provides a simple introduction into Apache Spark RDD operations. You can read more about RDD operations and passing functions into Spark in the next article: Apache Spark Resilient Distributed Dataset (RDD) Programming Transformation Operations.

Subscribe to our mailing list

* indicates required