Apache Spark Tutorial Stand Alone with Docker

Requirements

In order to complete this tutorial you will require an understanding of Docker and Docker Compose. If you are unfamiliar with this tool checkout my tutorial on Docker compose here. This tutorial also assumes that the reader has an understanding of Python and some basic maths background.

Objective

In this tutorial we will go through an example of using Spark to run a simplistic algorithm to find a numeric approximation to the number Pi.

Setup

We will be using a template available on github to setup the docker containers to run Apache Spark in standalone. Clone the following repository into your machine and run the docker containers via the instructions in this repo: https://github.com/gettyimages/docker-spark. Make sure that you port forward the required ports in VirtualBox in order to be able to use the links in the SparkUI.

Once you have build your docker machine and docker containers you should have 2 containers running. 1 is named dockerspark_worker_1 and the other is dockerspark_master_1. As Spark developers we will mostly be working on the master node of the cluster. 

Steps

Firstly we will be running pyspark in the master container. Exec into the master container via:

docker exec -ti dockerspark_master_1 bash

Next run pyspark: pyspark

Import the following packages:

import sys
from random import random
from math import sqrt
from operator import add

We will also need to define some parameters to run the algorithm across the cluster:

partitions = 10
n = 100000 * partitions

Next we will write a function that is the building block of the numeric approximation of Pi algorithm. The algorithm is like follows:

def g(_):
x = random()
y = random()
return 1 if sqrt(x**2+y**2)<=1 else 0

What this function does is that it determines whether a random x and y point is inside the first quadrant of a unit circle or not. If it is return 1 else return 0. 

Next we are going to run this function across our nodes a large number of times and aggregate it to get a final count number. 

In pyspark run the following command:

count = spark.sparkContext.parallelize(range(1, n+1), partitions).map(g).reduce(add)

count calculates the number of times out of 1000000 points that was inside the first quadrant of the unit circle.

NOTE: Pi = ratio of the number of points inside unit circle / number of points in a square or length 2. In order to get the numeric approximation of Pi we multiply count by 4 and divide by the total number of points (n).

print("Pi is roughly %0.16f" % (4.0*count/n))
# Pi is roughly 3.1414800000000001

You can check the SparkUI to see how long it took the algorithm to run to get this approximation. Note that this algorithm is not a very accurate algorithm and do not expect to get much accuracy beyond 2 decimal places.

The calculation of numeric approximation of Pi could have been further simplified by the following operation:

Pi = 4*spark.sparkContext.parallelize(range(1, n+1), partitions).map(g).sum()/n
print("Pi is roughly %0.16f" % (Pi))
# Pi is roughly 3.1414800000000001

Functions 

parallelize(): creates parallelized collections. The elements of the collection are copied to create a distributed dataset in order for it to be operated on in parallel. In this tutorial we passed a second parameter into the parallelize() function. This second parameter is the number of partitions to cut the dataset.

map(): applies the method passed in as a parameter to be applied to each element.

reduce(): Reduces elements of the RDD using the commutative and associative binary operator that was passed in.

Conclusion

Now you have implemented a simple algorithm in Spark that was distributed amongst the cluster. Play around with the various parameters available in the commands and see how that affects the results. Leave any comments for questions or findings.

Subscribe to our mailing list

* indicates required