Apache Spark Resilient Distributed Dataset (RDD) Programming Action Operations

Requirements

This article continues on from the preceeding article Apache Spark Resilient Distributed Dataset (RDD) Programming Transformation Operations. However, you will be able to follow the examples in this article without reading the previous article.

Actions

The most common action you will likely use on RDDs in Spark will be the reduce() function. This function take a function that operates on 2 elements of the type in your RDD object and returns a new elements of the same type.

Below is an example of using the reduce() function:

nums = sc.parallelize([1,2,3,4,5])
sum = nums.reduce(lambda a, b: a+b)
sum
# 15

Another common action is the fold() function. Similar to reduce() fold() also requires a functionw with the same signature as the function required in reduce(). In addition to this function, fold() requires an additional parameters which will be used for the initial call on each parition of the RDD.

The return type of fold() and reduce() must be of the same type as the input RDD. Sometimes we require to return a type that is different. One way to overcome this limitation of filter() and fold() is to use the map() function to transform every element into the required type before applying the fold() or reduce() actions.

Another action that is commonly used and is not restricted with regards to the return type is the aggregate() function. Like the fold() function aggregate() requires an initial zero value of type we need to return, we also supply a function to combine elements in the RDD, lastly we provide a function to merge the two accumulators. Below is an example of using aggregate to compute the average of an RDD object and avoid having to use a map() before a fold().

sumVals = nums.aggregate((0,0), (lambda acc, val: (acc[0] + val, acc[1]+1)),(lambda accum1, accum2: (accum1[0] + accum2[0], accum1[1] + accum2[1])))
sumVals[0]/float(sumVals[1])
# 3

Returning Data to Driver Program

The most common way to return data to the driver program is via the collect() method. The collect() method has a restriction in that all the data must be able to fit into memory.

There is also another function take(n) that returns n elements from the RDD to the driver program. Note that this is not a way to sample the data as it could potentially contain some bias in the way it gets the n elements. The data returned is unsorted and is not ordered.

If you wish to get an ordered data set back to the driver program you can use the top() function. The default behaviour is to get the default ordering of the elements, however, you can also provide your own comparison function.

If you wish to sample your dataset in an unbiased manner there is a takeSample(withReplacement, num ,seed) function that can be used.

If you wish to apply an action to each element without actually returning the data you can use the foreach() function.

There are plenty more action functions available in pyspark. I would strongly suggest that you read through their documentation and play around with them using your own data.

Conclusion

This article has provided an overview of the most frequently used actions used in pyspark. I strongly suggest that you read through the pyspark documentation as there are plenty more actions that this article did not cover. Leave a comment below if you found any interesting or cool actions that you would like to share.

Subscribe to our mailing list

* indicates required