Apache Flume is a service for streaming logs into Hadoop. It is a distributed and reliable service for efficiently collecting, aggregating and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). In this article, the authors adapt Flume for analysing and gathering data from Twitter.
Hadoop is an open source Apache framework based on Java, and is mainly used to store and process very large data sets of computer clusters. A Hadoop framework application can be scaled up to thousands of machines, each offering its own local computation and storage. There are two parts to the Hadoop framework, which is considered to be the core. One is HDFS, which stands for Hadoop Distributed File System — this is the storage part, The other is MapReduce, which takes care of data processing. Listed below are the modules that are part of the Hadoop framework,
- Hadoop Common
- HDFS – Hadoop Distributed File System
- Hadoop Yarn
- Hadoop MapReduce
Challenges with data load in the Hadoop framework
Once the data gets into the Hadoop framework, you have won half the battle but the real struggle begins when new, fast-moving data streams starts flowing in. For example, system or application logs, geolocation data and monitoring, as well as social media trends and updates — all constitute fast-moving streams requesting for storage in the Hadoop Distributed File System (HDFS). Hence, the need arose for creating a standard on how all the data streams could be directed into the HDFS more efficiently. That’s where the evolution began and, currently, there are many solutions in place and the most widely used one being Apache Flume.
What is Apache Flume?
Flume is an extremely reliable service, which is used to efficiently collect, aggregate and store large amounts of data. Multiple Flume agents can be configured to collect high volumes of data. It also provides a simple extensible data model that can be used for any analytic application. Some of the features of Apache Flume are given in Table 1.
|Streaming data and aggregator||This feature allows Flume to take in streams from multiple sources and then push them onto HDFS for storage and analysis|
|Contextual routing and it acts as a buffer||This feature helps it act as a mediator when the incoming data rate increases, and provides a steady flow of data. Also, it provides contextual routing.|
|Reliable and efficient data delivery||Flume uses channel-based transactions to guarantee reliable message delivery. It uses channel-based delivery, where there is always one sender and one receiver contained for every message.|
|Scales horizontally||Has the ability to take in new data streams and additional storage volume, on the fly.|
This article illustrates how quickly you can set up Flume agents to collect fast-moving data streams and to push the data into Hadoop’s file system from Twitter. By the time we’re finished, you should be able to configure and launch a Flume agent and understand how various data flows are easily constructed from multiple agents.
Setting up Flume to analyse Twitter data
The first step is to install Flume. Here, we are using CENT OS 6.7 in a virtual machine where we are going to deploy it.
Download Flume using the following command:
Extract the file from the Flume tar file, as follows:
tar -xvf apache-flume-1.4.0-bin.tar.gz
Put the apache-flume-1.4.0-bin directory inside the /usr/lib/ directory:
sudo mv apache-flume-1.4.0-bin /usr/lib/
Download the flume-sources-1.0-SNAPSHOT.jar and add it to the Flume class path (the JAR contains the Java classes to pull the tweets and save them into HDFS):
sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/
Check whether the Flume snapshot has moved to the lib folder of Apache Flume:
Copy flume-env.sh.template content to flume-env.sh:
cd /usr/lib/apache-flume-1.4.0-bin/ sudo cp conf/flume-env.sh.template conf/flume-env.sh
sudo gedit conf/flume-env.sh Set JAVA_HOME and FLUME_CLASSPATH in the flume-env.sh JAVA_HOME = <Path to Java JDK Folder> FLUME_CLASSPATH = “/usr/lib/apache-flume-1.4.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar” conf/flume.conf should have all the agents (Flume, memory and HDFS) defined as below.
To configure the Flume agent, type:
TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = <consumerKey> TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret> TwitterAgent.sources.Twitter.accessToken = <accessToken> TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = opensource, OSFY, opensourceforyou -> Here you could add multiple keywords into the source engine
TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100
The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from https://dev.twitter.com/apps.
TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.
The TwitterAgent.sources.Twitter.keywords value can be modified to get the tweets for some other topic like football, movies, etc.
To start Flume, type:
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
Apache Flume is extensively used these days due to the reliable transmission (one sender and one receiver) it provides and also due to its recovery features, which rapidly help us in the event of a crash. Also, since this is a part of the Apache Hadoop umbrella, we can get good support and always have great options to utilise all the features this tool provides.