Apache Installation and Building Stand Alone Applications

Objective

This article will teach you how to install Apache Spark on your local machine and how to run the spark shell. I will assume that the reader uses a linux/OSX environment. Windows users may need to use different commands.

Instructions

Firstly, you will need to download Apache Spark onto your machine. You can download Apache Spark from this link: http://spark.apache.org/downloads.html

Choose the default selections and use direct download. This will download a .tar file. Next extract the .tar file via: tar -xf <spark_file.tgz>

CD into the extracted directory.

For this article I downloaded Apache Spark version 2.1.0 so my commands may be slightly different depending on which version of Apache Spark you have installed.

You can also run spark in Scala. However, this article will focus on using Python.

Running Pyspark shell with IPython

Ipython is a very usefule python interpretter that is favored amongst python programmers. You will be glad to know that you can run IPython in pyspark. To do so run the following command: PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark

Once in the shell we can create RDD objects that we can then process. Try the following commands:

lines = sc.textFile("README.md")

lines is a RDD object, the fundamental data structure in Apache Spark. There are a number of useful functions that can be used on this object which will be shown below.

lines.count()
# 104 counts the number of elements in the RDD
distinct = lines.distinct() # Create and RDD containing the distinct values
distinct.count()
# 66

Spark Core Concepts

Every spark application has a driver program. This program contains the applications main function. The driver program is responsible for distributing the data over the various nodes in the clusters in order to apply operations on these datasets. In the pyspark shell a default driver program is created called sc. The sc object is a SparkContext object. With the SparkContext object you can create RDD objects such as the line sc.textFile(). The driver program also manages a number of nodes called executors.

Most of Spark's APIs relies on passing functions to its operators. Below is an example using lambda functions:

lines = sc.textFile("README.md")
sparkLines = lines.filter(lambda line: "Spark" in line)
sparkLines.count()
# 20
sparkLines.first()
# u'# Apache Spark'

Initializing a SparkContext

To initialize a SparkContext object in a python script is quite simple this can be done via the following line:

sc = SparkContext("local", "My App")

Two parameters are required to be passed into the SparkContext() object. The first is a cluster URL to tell Spark how to connect to a cluster. In this example we pass in local to use your host machine only. The second is an application name. This is so that you can identify your application in the Spark UI.

Building Standalone Applications

You can build standalone applications in python to be run in Apache Spark. 

For this example you will need to CD into your SPARK HOME DIRECTORY where the README.md file is located.

In this directory make a new python script like the below:

# my_script.py
from pyspark import SparkConf, SparkContext

logFile = "YOUR SPARK HOME/README.md"
sc = SparkContext("local", "My App")
logData = sc.textFile(logFile).cache()

numEs = logData.filter(lambda s: 'e' in s).count()
numIs = logData.filter(lambda s: 'i' in s).count()

print("Lines with e: %i, lines with i: %i" % (numEs, numIs))

sc.stop()

Note you will need to replace YOUR SPARK HOME with the correct file path.

You can then run the application via the following command:

bin/spark-submit --master local[4] my_script.py

# Lines with e: 60, lines with i: 59

Conclusion

This concludes a quick overview of how to install spark, run pyspark, and getting familiar with the spark shell as well as building stand alone applications.

Subscribe to our mailing list

* indicates required