Apache Spark is a lightning fast cluster computing designed for fast computation. In the era of big data where constraints are based on the processing power of your hardware. Apache Spark can help overcome some of these limitations by increasing processing power by orders of magnitude.
Apache Spark was built on top of Hadoop Map Reduce and extends that framework by the addition of more types of computation which includes Interactive Queries and Stream Processing.
The main concerns of using Apache Spark is in the memory consumption it will take. It takes a lot of resources and the way it consumes memory is not very user friendly.
Many organisations are beginning to use Hadoop in order to analyze their vast quantities of data.
Spark is not a modified version of Hadoop, nor is it dependent on Hadoop. Rather, using Hadoop is one way to implement Spark. Spark can use Hadoop in two different ways. The first is using it for storage, the other is for processing. Spark has its own cluster management computation so it will only need to use Hadoop for storage purposes only.
Resilient Distributed Datasets (RDD) is the fundamental datastructure in Spark. The datasets stored in the RDDs are partitioned so that they can be processed on different nodes of the cluster. RDDs are fault tolerant collection of data that can be run in parallel.
There are 2 ways to create RDDs. First method is to parallelise an existing collection in your iver program. The second method is to reference an external storage system that offers a HADOOP input format.