As the title suggests, this article is for newbies who want to venture into the field of machine learning.
Machine learning (ML) is an application of artificial intelligence that provides a system the ability to learn and improve from experience without being explicitly programmed. It has become very popular because of the high volume of data produced by applications, the increase in computational power in the past few years, and the development of better algorithms.
Machine learning is used in various domains—from automating tedious tasks to offering intelligent insights, industries in every sector try to benefit from it. You may already be using a device that has ML in it—for example, a wearable fitness tracker like Fitbit, or an intelligent home assistant like Google Home.
Machine learning terms can be confusing. Here are definitions of key terms to help you.
Data exploration is the process of gathering information about a large and often unstructured data set in order to find characteristics for focused analysis.
Data mining refers to automated data exploration.
Descriptive analytics is the process of analysing a data set in order to summarise what happened. The vast majority of business analytics—such as sales reports, Web metrics, and social networks analysis—are descriptive.
Predictive analytics is the process of building models from historical or current data in order to forecast future outcomes.
Supervised and unsupervised learning: We will explore these later in the article
Model training and evaluation: A machine learning model is an abstraction of the question you are trying to answer or the outcome you want to predict. Models are trained and evaluated from existing data.
When you train a model from data, you use a known data set and make adjustments to the model based on the data characteristics to get the most accurate answer. In Azure Machine Learning, a model is built from an algorithm module that processes training data and functional modules, such as a scoring module.
In supervised learning, if you’re training a fraud detection model, you use a set of transactions that are labelled as either fraudulent or valid. You split your data set randomly, and use a part to train the model and a part to test or evaluate the model.
Once you have a trained model, evaluate the model using the remaining test data. You use data you already know the outcomes for, so that you can tell whether your model predicts accurately.
Other common machine learning terms
- Algorithm: A self-contained set of rules used to solve problems through data processing, math, or automated reasoning.
- Anomaly detection: A model that flags unusual events or values and helps you discover problems—for example, credit card fraud detection looks for unusual purchases.
- Categorical data: This is data that is organised by categories and can be divided into groups. For example, a categorical data set for autos could specify the year of manufacture, the make, model, and price.
- Classification: This is a model for organising data points into categories based on a data set, for which category groupings are already known.
- Feature engineering: This is the process of extracting or selecting features related to a data set in order to enhance it and improve outcomes. For instance, airfare data could be enhanced by days of the week and holidays.
- Module: This is a functional part in a Machine Learning Studio model, such as the Enter Data module that enables entering and editing small data sets. An algorithm is also a type of module in Machine Learning Studio.
- Model: A supervised learning model is the product of a machine learning experiment comprising training data, an algorithm module, and functional modules, such as a Score Model module.
- Numerical data: This is data that has meaning as measurements (continuous data) or counts (discrete data). It is also referred to as quantitative data.
- Partition: This is the method by which you divide data into samples (see Partition and Sample for more information).
- Prediction: A prediction is a forecast of a value or values from a machine learning model. You might also see the term ‘predicted score’. However, predicted scores are not the final output of a model. An evaluation of the model follows the score.
- Regression: This is a model for predicting a value based on independent variables, such as predicting the price of a car based on its year and make.
- Score: This is a predicted value generated from a trained classification or regression model, using the Score Model module in Machine Learning Studio. Classification models also return a score for the probability of the predicted value. Once you’ve generated scores from a model, you can evaluate the model’s accuracy using the Evaluate Model module.
- Sample: This is a part of a data set intended to be representative of the whole. Samples can be selected randomly or based on specific features of the data set.
According to Arthur Samuel, machine learning algorithms enable computers to learn from data, and even improve themselves, without being explicitly programmed.
The basic principle of machine learning is to build algorithms that can receive input data, and use statistical analysis to predict an output while updating outputs as new data becomes available.
Types of machine learning
Machine learning can be classified into three types of algorithms:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised learning, as the name suggests, involves the presence of a supervisor as trainer. Basically, it is a learning process, with which we train or teach the machine using data that is well structured, i.e., data is tagged with correct answers. After that, the machine is provided with new sets of examples (data) so that the supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome from the structured data.
As an example, consider we have a basket full of different vegetables. The first step is to train the machine about all the different vegetables, one after the other:
- If the shape is cylindrical and the colour is brown, then it will be labelled as a potato.
- If the shape is like a cylinder and the colour is red or orange-green, then it will be labelled as a tomato.
The other vegetables are also categorised likewise.
Now, let’s suppose that after training the data, you give a separate vegetable, say, a tomato, and ask that it be identified. Since the machine has already learnt some things from previous data, this time it will have to use what it has learnt, sensibly. It will first categorise the vegetable on the basis of its shape and colour, and will validate the vegetable’s name as ‘Tomato’, putting it in the ‘Tomato’ category. Thus the machine learns certain things from the training data (the basket containing vegetables) and then applies that knowledge to the test data (a new vegetable).
This is the training of the machine using information that is neither classified nor structured, and allowing the algorithm to act on the items without supervision. Here the task of the machine is to group unsorted information according to similarities, patterns and differences without any prior training data. As the name suggests, no training is provided to the machine in this case. Therefore, the machine has to learn by itself from unstructured data. As an example, you may want to find groupings of customer demographics with similar buying habits.
This is an area of machine learning that is related to taking suitable action to maximise rewards in a particular situation, i.e., it is used to find the best possible path for particular circumstances. Reinforcement learning differs from supervised learning in that, in the latter case, the training data has the answer key with it; so the model is trained with the correct answer itself. However, in reinforcement learning, there is no answer key, and the reinforcement agent decides what to do to perform the given task. In the absence of a training data set, the machine learns from its experience (Figure 1).
We have an agent and a reward, with many hurdles in between. The agent is supposed to find the best possible path to reach the reward. The following example explains this. The objective of a robot is to get the reward, which is a diamond, and avoid the hurdles, which is fire. The robot learns by trying all the possible paths and then choosing the one that gives him the reward with the least hurdles. Each right step will give the robot a reward and each wrong step will take away a reward from it. The total reward will be calculated when it reaches the final reward, which is the diamond.
This article briefly touches on the fundamental concepts of machine learning. I hope it was helpful and will encourage you enough to delve into the topic.