Big data is exactly what the name suggests big data. Typically big data is a collection of very large datasets that cannot be processed using traditional computing techniques, or if it can it will take impractically significant amount of time (potentially beyond a typical human lifetime).
Big data involves data produced by devices and applications. Examples of these include the following:
From these examples Big Data can be seen as huge volume, high velocity and extensible data.
Big data typically falls into 1 of 3 different categories
Being able to analyse and interpret patterns and behaviours contained within this big data will have significant benefits to the owners of this big data. Examples include better understanding customer behaviour to assist with development of new products.
By accurately analysing these data the data scientist can provide advise on key decisions that can significantly improve operational efficiencies, cost reductions and risk reduction.
To get the most out of Big Data appropriate technological infrastructure is required to be put in place in order to be able to process that vast volumes of big data in real time whilst maintaining privacy and security of this data. There are already technolgy available from private companies to handle Big Data, however I will be foussing on open source Big Data technology.
Big Data technologies can be broadly categorised into 2 categories
This technology include systems such as NoSQL databases like MongoDB. This technology provides operational capabilities for real time, interactive work loads where data is captured and stored.
This technology includes systems like Massively Parralel Process (MPP) database systems and Map Reduce. This technology provides analytical capabilities for analysis that may touch most or all of the data. Map Reduce, an algorithm designed by Google, is a new method to analyze data that complements and is similar to how SQL analyzes data. The difference is that Map Reduce can be scaled up to be able to process vastly larger quantities of data then SQL.
These two classes of technology are complementary and are typically deployed together.