Developers

Analyse and gather Twitter data using Apache Flume

August 29, 2016

10675

Apache Flume is a service for streaming logs into Hadoop. It is a distributed and reliable service for efficiently collecting, aggregating and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). In this article, the authors adapt Flume for analysing and gathering data from Twitter.

Hadoop is an open source Apache framework based on Java, and is mainly used to store and process very large data sets of computer clusters. A Hadoop framework application can be scaled up to thousands of machines, each offering its own local computation and storage. There are two parts to the Hadoop framework, which is considered to be the core. One is HDFS, which stands for Hadoop Distributed File System — this is the storage part, The other is MapReduce, which takes care of data processing. Listed below are the modules that are part of the Hadoop framework,

Hadoop Common
HDFS – Hadoop Distributed File System
Hadoop Yarn
Hadoop MapReduce

Challenges with data load in the Hadoop framework
Once the data gets into the Hadoop framework, you have won half the battle but the real struggle begins when new, fast-moving data streams starts flowing in. For example, system or application logs, geolocation data and monitoring, as well as social media trends and updates — all constitute fast-moving streams requesting for storage in the Hadoop Distributed File System (HDFS). Hence, the need arose for creating a standard on how all the data streams could be directed into the HDFS more efficiently. That’s where the evolution began and, currently, there are many solutions in place and the most widely used one being Apache Flume.

What is Apache Flume?
Flume is an extremely reliable service, which is used to efficiently collect, aggregate and store large amounts of data. Multiple Flume agents can be configured to collect high volumes of data. It also provides a simple extensible data model that can be used for any analytic application. Some of the features of Apache Flume are given in Table 1.

Table 1

Feature	Description
Streaming data and aggregator	This feature allows Flume to take in streams from multiple sources and then push them onto HDFS for storage and analysis
Contextual routing and it acts as a buffer	This feature helps it act as a mediator when the incoming data rate increases, and provides a steady flow of data. Also, it provides contextual routing.
Reliable and efficient data delivery	Flume uses channel-based transactions to guarantee reliable message delivery. It uses channel-based delivery, where there is always one sender and one receiver contained for every message.
Scales horizontally	Has the ability to take in new data streams and additional storage volume, on the fly.

This article illustrates how quickly you can set up Flume agents to collect fast-moving data streams and to push the data into Hadoop’s file system from Twitter. By the time we’re finished, you should be able to configure and launch a Flume agent and understand how various data flows are easily constructed from multiple agents.

Setting up Flume to analyse Twitter data
The first step is to install Flume. Here, we are using CENT OS 6.7 in a virtual machine where we are going to deploy it.
Download Flume using the following command:

wget http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz

Extract the file from the Flume tar file, as follows:

tar -xvf apache-flume-1.4.0-bin.tar.gz

Put the apache-flume-1.4.0-bin directory inside the /usr/lib/ directory:

sudo mv apache-flume-1.4.0-bin /usr/lib/

Download the flume-sources-1.0-SNAPSHOT.jar and add it to the Flume class path (the JAR contains the Java classes to pull the tweets and save them into HDFS):

sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/

Check whether the Flume snapshot has moved to the lib folder of Apache Flume:

ls /usr/lib/apache-flume-1.4.0-bin/lib/flume*

Copy flume-env.sh.template content to flume-env.sh:

cd /usr/lib/apache-flume-1.4.0-bin/
sudo cp conf/flume-env.sh.template conf/flume-env.sh

Edit flume-env.sh:

sudo gedit conf/flume-env.sh

Set JAVA_HOME and FLUME_CLASSPATH in the flume-env.sh

JAVA_HOME = <Path to Java JDK Folder>

FLUME_CLASSPATH = “/usr/lib/apache-flume-1.4.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar”

conf/flume.conf should have all the agents (Flume, memory and HDFS) defined as below.

To configure the Flume agent, type:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
 
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>

TwitterAgent.sources.Twitter.keywords = opensource, OSFY, opensourceforyou -> Here you could add multiple keywords into the source engine

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
 
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from https://dev.twitter.com/apps.
TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.
The TwitterAgent.sources.Twitter.keywords value can be modified to get the tweets for some other topic like football, movies, etc.
To start Flume, type:

bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Apache Flume is extensively used these days due to the reliable transmission (one sender and one receiver) it provides and also due to its recovery features, which rapidly help us in the event of a crash. Also, since this is a part of the Apache Hadoop umbrella, we can get good support and always have great options to utilise all the features this tool provides.