Mahout is an open source machine learning library from the Apache Software Foundation. It implements many data mining algorithms like Recommender engines(), clustering(), classification() and is scalable to very large data sets (up to terabytes and petabytes) that are in the Big Data realm.
This article explains clustering techniques in Mahout. Clustering is used to arrange a group of abstract objects into classes of similar objects. It basically works on collection of data that is unlabelled and then analysed to identify a pattern, so that the data can be grouped together. Thus clustering discovers groups of similar features together. Clustering techniques are used in many fields such as market research to discover different groups of customers, in image processing for segmentation of images and pattern recognition, and in text or document classification for information discovery. These techniques are also used in outlier detection (also called anomaly detection), which means identifying events or observations that do not conform to an expected outcome, to help in identifying fraud in online transactions, etc.
The Clustering algorithms implemented in Apache Mahout are K-Means, Fuzzy K-Means, Streaming K-Means and Spectral Clustering. Clustering a group of objects involves three things:
- An algorithm, which is the method used to group things together.
- A notion of both similarity and dissimilarity which item belongs to an existing stack and which should start a new one.
- A stopping condition, which might be the point beyond which objects cant be stacked any more, or when the stacks are already quite dissimilar.
I will explain the clustering of news articles using K-means clustering, using Mahout on top of Hadoop. In this example, similar types of news articles will be grouped together. Hence, all news articles related to politics will be grouped together, those related to sports will be grouped together, and so on. The K-means clustering algorithm partitions n observations into k clusters, in which each observation belongs to the cluster with the nearest mean.
There are three steps involved in inputting data for the Mahout clustering algorithms. These are as follows.
- Identify various news articles, paste them in different text files in a single folder, and copy this folder to a Hadoop file system using the following command:
hadoop fs - put /home/hduser/news_article /news_article Source Destination
- Convert the files to sequenced files using the command seqdirectory, which generates an intermediate document representation in SequenceFile format from text documents under a directory structure.
./bin/mahout seqdirectory -i /news_article -o /news_article_seqdir_output ow \i Path to job input directory. -o The directory pathname for output. ow If present, overwrite the output directory before running job
Now convert the text documents in the SequenceFile format to vectors using the following command:
./bin/mahout seq2sparse i /news_article_seqdir_output -o / news_article_sparse_output
If you issue the command hadoop fs -ls /news_article_sparse_output, you can see many files like df-count, dictionary.file-0, frequency.file-0, f-vectors, tfidf-vectors, tokenized-document, and wordcount.
The dictionary file contains the mapping between a term and its integer ID, and the other folders are intermediate folders generated during the vectorization process. TF-IDF stands for term frequency-inverse document frequency. The number of times a term occurs in a document is called its term frequency (tf); it is a ratio of the number of times the word occurs in the document to the total number of words in the document. The inverse document frequency (idf) is the log of the ratio of total number of documents and number of documents containing the term. tf*idf is a multiplication of tf and idf.
- The next step is to run the Kmeans clustering algorithm:
./bin/mahout kmeans -i /news_article_sparse_output/tfidf-vectors -c /news_kmeans_cluster -o /news_kmeans_cluster_output -x 2 -ow -k 50 dm
This command takes the TF-IDF vector as an input, and -c input centroids as vectors. -o is the output directory, -x is the number of clusters,-k specifies that the centroids are randomly generated and are written to the input clusters folder, and -dm is distance measure (SquaredEuclideanDistance-Measure is set by default).
Mahout has a cluster dumper utility that can be used to retrieve and evaluate clustering data. Next, we need to run cluster dump:
./bin/mahout clusterdump -i /news_kmeans_cluster_output/clusters-1-final -o /home/hduser/news_kmeans.txt -p /news_kmeans_cluster_output/clusteredPoints -n 20
Here, -i is input (path to job input directory), -o is output directory pathname for output (it should point to your local file system and not the HDFS), -p is the directory containing points sequence files that map input vectors to their cluster (the program will output the points associated with the cluster), and -n is the number of top terms to print.
The generated output will have n: —, which is the number of elements in the cluster identifier; c=[z, …]: is the centroid of the cluster, with the zs being the weights of the different dimensions; r=[z, …]: is the radius of the cluster and the identifier signifying the clusters. This can again be programmatically converted for further interpretation of different clusters.
 Mahout in Action by Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman ©2012 by Manning Publications Co. All rights reserved.