Exploring Big Data on a Desktop: Archiving and Retrieving Emails using HBase and Phoenix


In this column, the author discusses how large volumes of mail can be archived and retrieved using Big Data tools, HBase and Phoenix.

One pressing challenge for any organisation is the storage and retrieval of emails. There are some obvious fields, like the subject, the sender and the content, which need to be searched. However, the header lines may be very important as well. For example, it is easy to fake the sender, so the server that sent the mail may be significant.
An email’s core components are the header, the content and the attachments. Each line of the header is a key-value pair, with each key representing a header field. There are some fields that are visible to the user, like the subject, the sender, the recipients, the date and time. Other fields may be examined in case of problems, and a user usually needs to make an additional effort to look at them on most graphical mail clients.

Planning for storage
Typically, each email is stored as a record in a data store. The list of possible header fields is very large. Mapping each one to a column would just not be convenient.
This is where the concept of the column family in HBase can provide an answer. For example, you may create an HBase table with the following column families:

  • Envelope (header fields like sender, recipients, subject, date)
  • Header (header fields not included in the envelope)
  • Content
  • Attachment

Create the HBase table from the HBase shell as follows:

[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/hbase shell
hbase(main):001:0> create ‘emails’,’envelope’,’header’,’content’,’attachment’
0 row(s) in 5.0390 seconds
=> Hbase::Table - emails
hbase(main):002:0> exit

Loading emails
Python has a very versatile mailbox module for working with various mail formats on the disk. As an illustration, consider messages stored in the mbox format. A message may have multiple parts and each part may be another message or an attachment. In order to keep the code as simple as possible, convert and store each part as a string. Ideally, the unique key for each email would not be a UUID as has been used in this example. It would start with a value that is useful for limiting the search of emails. For example, in a corporate set-up, it may be prefixed with the department and date. As in the previous article in this series, use happybase to connect to the thrift server of HBase. The following code in ‘load_mbox_file.py’ illustrates the idea.

#! /usr/bin/python
import mailbox
import happybase
import uuid
import sys
# A simple approach to mutipart
# Store each part as a string ignoring the type of content
def multipart_payloads(columns, msg):
num = 1
for part in msg.get_payload():
columns[‘content:’ + str(num)] = part.as_string()
num += 1

def store(table,msg):
# create a unique id for the row
row_id = str(uuid.uuid1())
# Process the headers
columns = {}
# some header keywords may appear multiple times
for key in set(msg.keys()):
if key in ENVELOPE:
cf = ‘envelope:’ + key
cf = ‘header:’ + key
# Get all the entries for a key as a list
# Store as a string.
# Can use eval to get the list back.
columns[cf] = str(msg.get_all(key))
# Multipart?
if msg.is_multipart():
columns[‘content:text’] = msg.get_payload()
table.put(row_id, columns)

connection = happybase.Connection(‘h-mstr’)
table = connection.table(‘emails’)
for message in mbox:

Run ‘load_mbox_file.py’ with the mbox filename as a parameter. The data should be stored in the HBase table ‘emails’.

In the above example, attachments have been treated like any other content. Hence, the column family ‘attachment’ has been ignored.

Searching the emails
There are quite a few options for SQL queries on HBase tables, including Apache Phoenix, Impala, Hbase with Hive, Apache Drill, etc. Apache Drill is interesting as it seems to be very versatile and flexible. However, at present, it only works with HBase 0.94, while the current version is 0.98. So, let us experiment with Apache Phoenix instead (http://phoenix.apache.org/), which is an SQL layer specifically meant for HBase.
Installation involves copying the Phoenix-*.jar files in the HBase lib directory and restarting HBase. Phoenix comes with a Python utility, sqlline.py. You may create and populate a table in the utility using the standard SQL syntax and it will create an HBase table and populate it.
You might want to use the existing HBase table and emails, by creating a view of it. For example:

[fedora@h-mstr phoenix-4.2.2-bin]$ bin/sqlline.py localhost
0: jdbc:phoenix:localhost> CREATE VIEW eview (pk VARCHAR PRIMARY KEY, “envelope”.”Subject” VARCHAR, “envelope”.”To” VARCHAR, “envelope”.”From” VARCHAR, “envelope”.”Date” VARCHAR,”header”.”Content-Type” VARCHAR) AS SELECT * FROM “emails”;

The double quotes are needed as sqlline.py assumes that the table and column names are in upper case. Column names of the view will be the qualifiers of the column families.
Now, you can run some queries on it:

0: jdbc:phoenix:localhost> select * from eview where “Subject” is not null limit 5;
0: jdbc:phoenix:localhost> select * from eview where “Content-Type” ilike ‘%multipart%’ limit 5;
0: jdbc:phoenix:localhost> select “From”, “Subject” from eview where “Date” ilike ‘%jan%2011%’ limit 5;

In case you need to examine some other column in a column family, create another view. You can find out more about the options available at http://phoenix.apache.org/language/; in particular, you can learn to create secondary indices on a view.
It would be easy to include the storage of text messages, other than emails, as well in the above system. You could probably still index the contents and attachments using elasticsearch or a similar tool. Combining the text search with the SQL query layer on HBase makes it possible to offer remarkable solutions to your users, helping them to analyse and search massive volumes of data with minimal effort.


Please enter your comment!
Please enter your name here