The world we are living in today is one of information and data overload. The processing of this data can only be handled efficiently by a distributed framework like Mahout which, in conjunction with Hadoop, can be competently used for data mining. That it is used by large corporates like Facebook, Foursquare, Twitter, LinkedIn and Yahoo! is testimony to its effectiveness.
Apache Mahout is an open source project that is used to construct scalable libraries of machine learning algorithms. Initially, it started as a child project of Apache Lucene. In 2010, it became the top-level project for Apache. There are many open source machine learning libraries available in the market but Apache Mahout stands out for various reasons:
- Many other open source libraries are only research oriented.
- There is a scarcity of resources, i.e., documentation and examples for all open source ML libraries, whereas there are ample resources available for Mahout.
- The community surpasses others.
- Apache Mahout provides greater scalability than others.
All these reasons make Apache Mahout the perfect choice for creating ML algorithms that scale.
Mahout is supported by its three pillars.
- Recommender engines (collaborative filtering): Recommenders can be classified as being user based or item based. Amazon and Facebook use this feature to attract users and suggest products by mining user behaviour. An example of how this feature is used is shown in Figure 1.
- Clustering: Clustering tries to make a cluster of things that share similarities. In simple words, it groups objects of a similar nature in one place. There are many algorithms available for clustering, like K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet Classification, etc. A Google app called Summly shows the news from different news sites, in brief (Figure 2).
- Classification: Classification techniques decide whether a thing deserves to be a part of some type or not. Here, types are predetermined. Features of items in the same group are compared. Classification can predict the type of any new object based on its features. Facebook’s face detection and spam checker use this technique.
Features of the Mahout framework
- The Mahout framework is tightly coupled with Hadoop. So, it is very useful for distributed environments where Mahout uses the Apache Hadoop library to scale in the cloud.
- Developers can use Mahout for mining large volumes of data as it is a ready-to-use framework.
- Through Mahout, applications can analyse data faster and more effectively.
- MapReduce enabled clustering implementations are supported by Mahout—for example, clustering algorithms like K-Means, Fuzzy K-Means, Canopy, Dirichlet and Mean-Shift.
- It also supports distributed and complementary Naive Bayes classification implementations.
- Distributed fitness function capabilities are an inbuilt part of Apache Mahout for evolutionary programming.
- It includes vector and matrix libraries.
- It also has examples of all the above-mentioned algorithms.
- Mahout has great community support, which cannot be found for any other open source ML library.
Applications of the Mahout framework
- IT giants like Facebook, LinkedIn, Adobe, Twitter and Yahoo! use Mahout.
- Foursquare uses the Mahout recommender engine to serve you by finding the places, entertainment options and food to your liking, in a specific area.
- User interest can be modelled, and Twitter uses Mahout for that.
- The pattern mining of the Mahout framework is used by Yahoo!