Exploring Clustering Methods in Machine Learning


In machine learning, clustering is a technique that groups data points. Ideally, data points in a cluster should have similarities in properties or features, which differ from those not in the same cluster. This article discusses two clustering algorithms – k-means and DBSCAN.

One of the most popular techniques in data science is clustering, a machine learning (ML) technique for identifying similar groups of data in a data set. Entities within each group share comparatively more similarities with each other compared to with those from other groups. Clustering means finding clusters in an unsupervised data set. A cluster is a group of data points or objects in a data set that are similar to other objects in the group and dissimilar to data points in other clusters. Clustering is useful in several exploratory pattern-analysis, grouping, decision-making and machine-learning situations, including data mining, document retrieval, image segmentation and pattern classification. However, generally, the data is unlabelled and the process is unsupervised in clustering.

For example, we can use a clustering algorithm such as k-means to group similar customers as mentioned and assign them to a cluster, based on whether they share similar attributes, such as age, education and so on.

Some of the most popular applications of clustering include market segmentation, medical imaging, image segmentation, anomaly detection, social network analysis and recommendation engines. In banking, analysts examine clusters of normal transactions to find the patterns of fraudulent credit card usage. Also, they use clustering to identify clusters of customers – for instance, to find loyal customers versus customers who have stopped using their services.

Figure 1: k-means clustering

In medicine, clustering can be used to characterise patient behaviour, based on similar characteristics, so as to identify successful medical therapies for different illnesses. In biology, clustering is used to group genes with similar expression patterns or to cluster genetic markers to identify family ties. If you look around, you can find many other applications of clustering, but generally, it can be used for one of the following purposes: exploratory data analysis, summary generation or reducing the scale, outlier detection (especially to be used for fraud detection or noise removal), finding duplicates and data sets; or as a pre-processing step for either prediction, other data mining tasks or as part of a complex system.

Figure 2: Euclidean distance

A brief look at different clustering algorithms and their characteristics


Please enter your comment!
Please enter your name here