Apache Spark

What is Apache Spark?

Apache Spark is a lightning fast cluster computing designed for fast computation. In the era of big data where constraints are based on the processing power of your hardware. Apache Spark can help overcome some of these limitations by increasing processing power by orders of magnitude.

Apache Spark was built on top of Hadoop Map Reduce and extends that framework by the addition of more types of computation which includes Interactive Queries and Stream Processing.

Pros of Apache Spark

  • In memory cluster computing platform. This means Apache Spark has faster batch processing power then Map Reduce and could potentially be 10-100 times faster then Map Reduce.
  • Support for sophisticated analytics. Apache Spark comes with so many useful ecosystems which includes Spark, SQL, Spark Graphx, Spark MLib and Spark Streaming. Apache Spark also support Machine Learning for future predictions which makes it a very useful tool to be incorporated into any Data Science team.
  • Support plenty of languages. Including Java, Scala and data science programming languages such as Python and R.

Cons of Apache Spark

The main concerns of using Apache Spark is in the memory consumption it will take. It takes a lot of resources and the way it consumes memory is not very user friendly.

Spark and Hadoop

Many organisations are beginning to use Hadoop in order to analyze their vast quantities of data.

Spark is not a modified version of Hadoop, nor is it dependent on Hadoop. Rather, using Hadoop is one way to implement Spark. Spark can use Hadoop in two different ways. The first is using it for storage, the other is for processing. Spark has its own cluster management computation so it will only need to use Hadoop for storage purposes only.

Resilient Distributed Datasets (RDD)

Resilient Distributed Datasets (RDD) is the fundamental datastructure in Spark. The datasets stored in the RDDs are partitioned so that they can be processed on different nodes of the cluster. RDDs are fault tolerant collection of data that can be run in parallel. 

There are 2 ways to create RDDs. First method is to parallelise an existing collection in your iver program. The second method is to reference an external storage system that offers a HADOOP input format. 

Subscribe to our mailing list

* indicates required