Big Data

What is Big Data?

Big data is exactly what the name suggests big data. Typically big data is a collection of very large datasets that cannot be processed using traditional computing techniques, or if it can it will take impractically significant amount of time (potentially beyond a typical human lifetime). 

Examples of Big Data

Big data involves data produced by devices and applications. Examples of these include the following:

  • Social Media Data: Think about the amount of Facebook posts, Twitter tweets and Instagram pictures that are created every day.
  • Transport Data: Data to track location, status of vehicles. Think about how Uber will need to be able to keep track of where the various vehicles are and try to allocate an available vehicle to a new Uber passenger.
  • Search Engine Data: Search engines such as Google need to store all the pages that are available on the world wide web somewhere as well as additional information about these pages. These search engines that need a way to be able to retrieve these pages depending on a search request by a user in a timely manner.

From these examples Big Data can be seen as huge volume, high velocity and extensible data.

Types of Big Data

Big data typically falls into 1 of 3 different categories

  1. Structured Data: Relational Databases.
  2. Semi Structured Data: XML data.
  3. Unstructured Data: Word documents, PDF documents, free form text, media logs etc.

Being able to analyse and interpret patterns and behaviours contained within this big data will have significant benefits to the owners of this big data. Examples include better understanding customer behaviour to assist with development of new products.

By accurately analysing these data the data scientist can provide advise on key decisions that can significantly improve operational efficiencies, cost reductions and risk reduction.

Big Data Technology

To get the most out of Big Data appropriate technological infrastructure is required to be put in place in order to be able to process that vast volumes of big data in real time whilst maintaining privacy and security of this data. There are already technolgy available from private companies to handle Big Data, however I will be foussing on open source Big Data technology.

Big Data technologies can be broadly categorised into 2 categories

  1. Operational Big Data
  2. Analytical Big Data

Operational Big Data

This technology include systems such as NoSQL databases like MongoDB. This technology provides operational capabilities for real time, interactive work loads where data is captured and stored.

Analytical Big Data

This technology includes systems like Massively Parralel Process (MPP) database systems and Map Reduce. This technology provides analytical capabilities for analysis that may touch most or all of the data. Map Reduce, an algorithm designed by Google, is a new method to analyze data that complements and is similar to how SQL analyzes data. The difference is that Map Reduce can be scaled up to be able to process vastly larger quantities of data then SQL.

These two classes of technology are complementary and are typically deployed together.

Subscribe to our mailing list

* indicates required