Everything you need to know about KMeans

4 min readJun 15, 2021

Remember my Machine Learning blog where I explained about Unsupervised Learning. I talked about how can we segregate/group people based on their hobbies or gender. One of the essential parts of Unsupervised Learning is Clustering. Let’s learn about the most straightforward clustering algorithm today, KMeans Algorithm. Before diving into KMeans directly, let’s understand a bit about Clustering.

Clustering is the task of grouping a set of objects in the same group (cluster) that are more similar to each other than those in other groups (clusters). Our goal is to segregate groups with similar traits and assign them to a group/cluster. For example, we can segregate two groups here, one related to gender and the other to hobbies.

Clustering is an Unsupervised Machine Learning problem because we deal with no target variable datasets here. We look at the data and then try to club similar observations and form different groups.

KMeans

KMeans clustering is the most straightforward and popular unsupervised Machine Learning algorithm. It searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. The procedure to perform this algorithm is effortless. The main idea is to define k centres, one for each cluster. We pick those data points in a cluster that are closest to each other. The main objective is to minimize the distances between the data points and their respective cluster centroids.

Intuition

This image shows what we want to achieve after performing KMeans Clustering to our dataset. The steps to perform this algorithm are as follows:

Initialize a k value. It is all about choosing the number of clusters initially/randomly. Let’s choose our K-value as 2 for now.
Initialize k random points from the data as centroids. If we have chosen our k-value as 2, then we’ll choose two centroids.
Find out the distance between 2 points and figure out the nearest points using the Euclidean Distance formula.
Select the group and recompute the centroids of newly formed clusters.
Till now, it was only one iteration. We then repeat the 3rd and the 4th step.

When should we know that it’s time to stop the KMeans algorithm?

We can stop the algorithm when centroids of the newly formed clusters do not change, the maximum number of iterations is reached, and points remain in the same cluster and don’t change.

How do we figure out the optimum number k value? Or How do we decide the optimal number of clusters for a dataset?

Here comes the concept of “The Elbow Method”. It is used in determining the number of clusters in a dataset. The method consists of plotting the number of clusters and picking the elbow as the number of clusters to use. Here the elbow means the point where we see an abrupt/sudden decrease. We’ll be easily able to understand the Elbow method when I show the implementation of KMeans in the upcoming blog.

Assumptions

The assumptions are:

The cluster centre is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its cluster centre than to other cluster centres.

Applications

Market segmentation, where existing and potential customers are divided into sub-groups of customers based on some shared characteristics.
Feature Learning.
Computer Vision.

Advantages

Simple to implement.
If variables are enormous, then it performs faster than hierarchical Clustering.

Disadvantages

We are choosing k manually.
It is very sensitive to outliers and noisy data.

The blog talks about the intuition behind KMeans. The following blog will implement KMeans on a dataset using Python Programming Language and the robust mathematical approach behind KMeans. This blog gave a gist about KMeans without any Mathematical intuition.