Taming the Big Data Beast with Hadoop and Alternatives

Taming the Big Data with Hadoop

Taming the Big Data with Hadoop

Turn the clock back to 1970 — the year in which Edgar Codd published the paper “A Relational Model of Data for Large Shared Data Banks”. In it he described a language called “Structured Query Language” (SQL), which went on to spearhead the RDBMS era. It became the language of choice for accessing RDBMSs. Forty-two years later, we are again at the brink of a data revolution — the Big Data revolution.

The volume, velocity and variety of data (the so-called three Vs) are often cited as the characteristics of big data. Most of the producers of big data handle volumes exceeding 10 terabytes of data, some even exceeding 1 petabyte.

In terms of velocity, or the rate at which the data is produced, 30 per cent of big data producers are bringing out more than 100 gigabytes of data every day. In terms of variety, while 55 per cent of producers are bringing out structured data, the rest are producing unstructured data. While SQL is ideal for processing structured data, it cannot handle unstructured data well. New data-processing solutions are needed to handle unstructured data.

Hadoop addresses this need. It is a scalable solution for handling large volumes of unstructured data.

What is Hadoop?

Hadoop has its origins in Google’s MapReduce — a computational paradigm introduced by Google to carry out distributed computations on large sets of data on a cluster of commodity hardware. The large data is itself stored in a distributed file system, so that it can be accessed as a contiguous volume in a transparent manner. The distributed file system also needs to provide redundancy, to prevent any data loss.

MapReduce consists of two steps — a Map phase and a Reduce phase. In the Map phase, a master node partitions the input into sub-problems and distributes these tasks to worker nodes. The worker nodes process the assigned inputs and pass results back to the master. In the Reduce phase, the master node takes the answers collected from worker nodes for sub-problem tasks, and “reduces” these intermediate results to a solution for the original problem it is trying to solve.

The standard textbook example for MapReduce is to count the number of occurrences of each word in the input. The map phase starts with parsing a section of input and emitting a set of key-value pairs such as <word, 1> for each occurrence of the word it sees. The key-value pairs can then be sorted and passed to a set of reducers, which combine/reduce the same word occurrences into the total count of occurrences of each word.

The key thing to note is that both the Map and Reduce phase are parallel operations — a number of mappers can run in parallel, and so can a number of reducers. MapReduce also has the ability to restart computation in case any of the worker nodes fail, as well as supporting data reliability by replication.

Hadoop MapReduce is nothing but the open source implementation of the MapReduce paradigm, using a special-purpose distributed file system known as HDFS. Hadoop is an open source project from Apache that provides a software framework for reliable, scalable distributed computing.

Hadoop is meant to be scalable, designed to run on infrastructure ranging from a large supercomputer to a cluster of standard PCs, each offering local computing capabilities and local storage. Instead of depending on custom hardware to achieve reliability in large-scale distributed computations, reliability is built into the software library, with the assumption that hardware failures are likely to occur, and it is the software’s responsibility to handle such failures.

The Hadoop open source project consists of three basic components: Hadoop Utilities, Hadoop MapReduce Framework and HDFS. Hadoop also has a number of sub-projects built on it, such as HBase, a scalable distributed database for structured data; Hive, a data warehouse built atop Hadoop; and ZooKeeper, a coordinated service for distributed computations atop Hadoop.

This article will cover just Hadoop’s MapReduce framework. Let us look at the various offerings based on Hadoop’s MR framework and its alternatives.

Hadoop offerings in the market

There are a wide variety of Hadoop offerings in the market. There are the pure Hadoop distributions such as Cloudera, HortonWorks, EMC’s Hadoop version called GreenPlum HD, and MapR’s Hadoop distribution. These are packaged software distributions for setting up and running Hadoop on clusters based on the open source ASF Hadoop project. These packaged distributions of Hadoop facilitate its easy adoption, instead of the user having to download the different packages needed to set it up from hadoop.apache.org.

These Hadoop distributions are often integrated with other Data Warehousing solutions or MPP databases, so that data movement can occur seamlessly between Hadoop clusters and the other data management solutions. For instance, as part of their AWS, Amazon offers the Elastic MapReduce Service, which delivers an enterprise-grade Hadoop platform on the cloud on top of a standard ASF Hadoop distribution.

A number of enterprise vendors give Hadoop offerings on top of their data management solutions. For example, EMC’s GreenPlum Unified Analytics Platform combines GreenPlum’s Massively Parallel Database meant to handle the structured data; a Hadoop distribution based on ASF, known as Greenplum HD; and Chorus, a productivity and group-ware layer for data science teams.

By allowing data movement and interoperability between the Greenplum database and Hadoop, it is possible for a query to access both sources of data, and combine the information retrieved from these two sources in an intelligent manner.

IBM offers its own Hadoop solution, which is known as InfoSphere BigInsights. This is under IBM’s InfoSphere big data information management brand of offerings. IBM’s BigInSights supports the Jaql query language. This supports interoperability of the Hadoop offering with other IBM products such as their enterprise data-warehousing solution IBM Netezza, and their RDBMS DB2. BigInSights also includes various Hadoop enhancements, including cluster administration and management tools, as well as textual analysis tools.

Microsoft also offers a Hadoop distribution integrated with its Windows Server product, as well as with its cloud platform Azure. Microsoft has developed a Javascript API for its Hadoop distribution, which allows the developer to write MapReduce tasks in Javascript. The company also provides the integration of Hadoop with Visual Studio and .NET to enable developers to rapidly develop Hadoop-based applications on the Windows platform.

Oracle, too, has jumped on the Hadoop bandwagon by offering a Hadoop appliance. This appliance includes Hadoop, R for analytics, a new Oracle NoSQL database, and connectors to Oracle’s database and Exadata data warehousing product line.

Its Hadoop distribution is from Cloudera. Many of the existing MPP (massively parallel processing) databases that are specialised for processing structured big data also offer Hadoop connectors. For instance, Aster Data MPP (which was bought over recently by Teradata), ParAccel MPP and Vertica MPP (which was acquired by HP) offer Hadoop connectors. There are also a number of Hadoop management software solutions, to manage and troubleshoot Hadoop implementations.

Many of the Hadoop distributions themselves provide management and troubleshooting tools. There are also third-party offerings for Hadoop management from PlatformComputing, which was acquired by IBM, and Zettaset.

There are even a number of open source alternatives to Hadoop. Some of the popular ones are Disco, Spark, BashReduce, GraphLab, Storm (tech.backtype.com/), HPCC systems, etc.

Disco is an offering from Nokia, which allows developers to write MapReduce jobs in Python, while its backend is built using Erlang, the functional language, which has built-in support for consistency, fault tolerance and job scheduling. Disco does not use HDFS, but uses a fault-tolerant distributed file system DDFS. Spark was developed by UC Berkeley and it allows in-memory MR processing distributed across multiple machines. Spark is implemented using Scala.

While I cannot cover the details of each open source Hadoop alternative, I would like to make a mention of HPCC systems from LexisNexis. HPCC has its own ecosystem equivalent of Hadoop, with its own distributed file system. It supports the development of massively parallel processing jobs by allowing them to be specified in a high-level data-declarative language known as Enterprise Control Language (ECL).

There are also a number of application software products based on Hadoop. These typically simplify writing MapReduce jobs for Hadoop, or enable the analysis of data stored in Hadoop clusters through alternate means rather than the traditional MR paradigm. For instance, Karmasphere provides a Hadoop abstraction layer known as Karmasphere Analyst for Hadoop. Hadapt combines Hadoop and relational DBMS technology into a single system, making it easy to analyse multi-structured data.

Just as SQL and RDBMS marked the beginning of a new era in data management, this is the beginning of a new era in handling big data, with Hadoop heralding new techniques in processing unstructured data.


  1. who can help me how to install, config and test gatekeeper on linux???I’m doing a lab in linux but I can’t install gnu gatekeeper in it!help me!

  2. I agree. RDBMSs are inappropriate tools to handle and analyze big unstructured data we – Internet users – are producing daily. Big Data requires a new point of view, new data management techniques and tools.


Please enter your comment!
Please enter your name here