Carry Out Data Mining and Machine Learning with WEKA

Weka-birdWEKA is a collection of visualisaton tools and algorithms for data analysis and predictive modelling. It is very popular among users and developers as can be seen by the large number of downloads.

In the current era of Big Data and enormous data processing tasks, the analysis of huge data sets has become a complex and challenging issue. There are assorted software products in the global market for data interpretation and behaviour analysis, yet the development and enhancements in data mining and machine learning tools is continuously being researched.

Data mining, or to put it simply, the knowledge discovery process, is one of the thrust areas for scientists, practitioners as well as research scholars searching for patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Data mining extracts information from a large data set and transforms it into a specific structure that can be used for analysis and future predictions. Data mining tasks involve data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualisation, and online updating.

A number of free and open source software tools are available for data mining and machine learning tasks that can handle data sets of any discipline, whether they are for agriculture, defence, forensics, gaming, business, science or medical data.

WEKA (Waikato Environment for Knowledge Analysis) is an excellent open source software that is widely used and very famous in the research market. Thousands of practitioners and research scholars execute data mining algorithms on the WEKA tool. WEKA is a globally accepted tool of machine learning software that is developed in Java by the University of Waikato, New Zealand under the GNU General Public License. A compilation as well as collection of machine learning algorithms for data mining tasks has been embedded in WEKA. These data mining and machine learning algorithms can be applied to the dataset of any domain. Besides, the algorithms can be called from its own Java code. The software has a collection of tools for various data mining primitive tasks including data pre-processing, classification, regression, clustering, association rules and visualisation. The software is also well-suited to develop new algorithms for data mining and machine learning.

The official website to download WEKA from is http://www.cs.waikato.ac.nz/ml/weka. The tool can be downloaded for multiple operating systems from http://www.cs.waikato.ac.nz/ml/weka/downloading.html. Rich documentation is available for users as well as developers. The current stable release of WEKA is 3.6.10. It can also be downloaded from the Web-based source code repository SourceForge.net The popularity of WEKA can be estimated by the download statistics: 16,000 downloads per week from SourceForge.net [noted from SourceForge.net Download Statistics].

The WEKA user interface
Explorer is the main user interface of WEKA, but the same functionality can be accessed through the component-based Knowledge Flow interface as well as from the command line. There is also a module called Experimenter, which allows the systematic comparison of the predictive performance of machine learning algorithms on a collection of datasets. Explorer has several panels providing access to the main components of the workbench: the Preprocess panel has facilities for importing data from a database, a CSV file, etc, and to preprocess this data using a filtering algorithm. Such filters can be used to transform the data and make it possible to delete instances and attributes as per specific criteria.

The Classify panel provides the features to apply classification and regression algorithms to the dataset, to estimate the accuracy of the resulting predictive model and visualise erroneous predictions, ROC curves or the model. The Associate panel provides the access for association rule learning to identify the interrelationships between attributes in the data. The Cluster panel or module provides access to the clustering techniques, including simple k-means algorithm and many others. The Select attributes panel provides access to the algorithms for the identification of the most predictive attributes in a dataset. The Visualize panel depicts a scatter plot matrix in which individual scatter plots can be selected, enlarged and analysed using various selection operators.

Data file formats in WEKA
WEKA can execute the data files in multiple formats including ARFF, CSV, XML, LIBSVM, BSI, DAT and C4.5. A spreadsheet can be easily converted to CSV (Comma Separated Values). WEKA imports the CSV file format and any machine learning algorithm can be applied.

There are a number of file formats, but classically, the ARFF (Attribute Relationship File Format) is widely used for analysis of datasets.
ARFF files have two distinct sections. The first is the header information, which is followed by the data information. Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive. The ARFF Header section of the file contains the relation declaration and attributes declarations. Each attribute in the data set has its own @attribute statement, which uniquely defines the name of that attribute and its data type.

The ARFF file can be easily viewed in the table format by the ARFF viewer of the WEKA tool. Using this option, the typical ARFF file can be viewed in tabular form.

One exceptional feature of WEKA is the database connection using JDBC with any RDBMS package. Let’s suppose that you have collected data from a government organisation or agency. If the data collected is in database format’MySQL, MS Access, DB2 or any other, then there is no need to worry about whether we will have to convert the millions of datasets in the WEKA-specific file formats. We can simply connect WEKA with any database management software using JDBC, and can import any number of datasets of that database server.

Implementation of the Apriori algorithm
The Apriori algorithm is the most widely used algorithmic approach for frequent item set mining as well as association rule learning on transactional databases. It works by identifying the frequent individual items existing in the database, and enhancing them to larger and larger item sets as long as those item sets show up often in the database. The frequent items extracted by Apriori can be used to establish the association rules, which emphasise general trends in the database. This has applications in domains such as market basket analysis, forensics, agriculture and many others.

Implementation of the OneR/1R/1-Rule classification
The 1R learning algorithm is the rule-based classification learning algorithm for discrete attributes. The algorithm generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. It is a simple, cheap method that often comes up with quite good rules for characterising the structure in the data. The algorithm indicates that simple rules frequently achieve surprisingly high accuracy. In this algorithm, just one attribute is sufficient to determine the class of an instance quite accurately.

Saving graphs and images in WEKA
Generally, we take screenshots of images/graphs to include the data interpretation and results in the report. WEKA provides the magic button Alt+Shift+Left-Click. The graph or chart can be saved in BMP, EPS, JPEG or PNG formats without taking a manual screenshot.

Though data mining algorithms are incorporated in many other free and open source software products, WEKA is used because of its user friendly standalone platform for data mining tasks including preprocessing, clustering, regression, classification and visualisation. The support for various data sources can be extended through Java Database Connectivity, despite the default flat file format. The models in WEKA can be built using a graphical user interface as well as command line input.

WEKA is used by most researchers and practitioners as it possesses great flexibility in terms of platform, portability as well as embedded algorithms. The graphical user interface of WEKA is also very user friendly. Moreover, the Java based IDEs including Eclipse or NetBeans can also be used to customise WEKA. Generally, research scholars develop their own algorithms and want to test them on WEKA. For this, there are WEKA APIs that can be integrated in the Eclipse IDE for research and development.

Other FOSS for data mining and machine learning

ELKI – Environment for DeveLoping KDD-Applications Supported by Index-Structures
Orange
RapidMiner
Rattle
KNIME (Konstanz Information Miner)
JasperSoft
Scriptella ETL

 

 

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.