Apache Spark Pyspark Actions on Pair RDD

Introduction

All the regular actions availabe on base RDD objects are availabe to Pair RDD objects. If you are unfamiliar with these actions please read my article on them Apache Spark Resilient Distributed Dataset (RDD) Programming Action Operations.

In addition to these actions Pair RDD objects have some extra actions that takes advantage of the key/value nature of the data.

Actions Available on Pair RDDs

The most common actions used on Pair RDDs are countByKey(), collectAsMap() and lookup(key). Below is an example:

pairRDD = sc.parallelize([(1,1),(2,2),(1,4),(3,5)])
pairRDD.countByKey()
# defaultdict(int, {1: 2, 2: 1, 3: 1})

pairRDD.collectAsMap()
# {1: 4, 2: 2, 3: 5}

pairRDD.lookup(1)
# [1, 4]

Note that with collectAsMap() it will only return 1 value of duplicate keys.

Conclusion

This concludes a brief summary of the most common actions used on Pair RDDs. There are some more that will be discussed in more detail in a future article.

Subscribe to our mailing list

* indicates required