
  • Clustering is an unsupervised approach which finds a structure or pattern in a collection of unlabeled data.

  • cluster is a collection of objects which are similar amongst themselves and are dissimilar to the objects belonging to a different cluster.

  • In Other words, clustering identifies homogeneous subgroups among the observations.

  • Clustering is a data analysis technique that groups similar observations or data points together into homogeneous subgroups.

  • The goal is to identify patterns or structures within the data, making it easier to understand and analyze.

  • By organizing data into clusters, you can gain insights into the underlying structure and relationships, which can be valuable for various applications such as customer segmentation, anomaly detection, and pattern recognition.

  • Let’s suppose we give a child different objects to group. How does a child make a group? The child may group over the colour, over the shape, over the hardness or softness of the objects etc.

  • In below figure, we can easily identify 4 different clusters. The criteria here is distance. Whichever points are near to each other are kept in the same cluster and the distant points belong to a different cluster

Types of Clustering:

  1. K-Means Clustering
  2. DBScan
  3. Hierarchical clustering

K-Means Clustering:

  • An unsupervised machine learning technique to identify clusters of data objects in the dataset.

  • K specifies the number of clusters.

  • Steps involved in K-Means Clustering:

    • First choose the number of K clusters.
    • Start with K centroids by putting them at random place (not necessarily from your dataset). A centroid is a data point (imaginary or real) at the center of a cluster.
    • Assign each point to the closest centroid. That forms K cluster.
    • Compute the distance of every point from the centroid.
    • Compute and place the new centroid for each cluster i.e: calaculate the mean value of the objects for each cluster and update the cluster mean. c_i = Σ x_j/N


    – c_i is the new centroid for cluster “i.”
    – N_i is the number of data points in cluster “i.”
    – x_j represents the coordinates of each data point in the cluster “I.”

    • Repeat the process again i.e: reassign each data point to the new closest centroid & go to step 4.
    • If there is no change i.e: when clusters form a clear boundary, then stop

Python Implementation for K-Means Clustering:

Problem Statement: We need to cluster iris flowers into certain groups with the iris dataset

# import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Load the dataset

data1 = sns.load_dataset('iris')
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Basic Checks:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
      sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Skipping EDA, Data Preprocessing, Feature engineering 

Feature selection:

# As it is unsupervised machine learning , we will take input variables only

X = data1.iloc[:,0:-1]
  sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2

Model Creation

from sklearn.cluster import KMeans

model1 = KMeans(n_clusters=4,random_state=24)

label1 = model1.fit_predict(X)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 3, 0, 3, 0,
       0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 3, 0, 0, 0,
       3, 3, 3, 0, 3, 3, 3, 3, 3, 0, 3, 3, 2, 0, 2, 2, 2, 2, 3, 2, 2, 2,
       0, 0, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 0], dtype=int32)
df= pd.DataFrame({'Predicted label':label1,'Actual label':data1['species']})

   Predicted label Actual label
0                 1       setosa
1                 1       setosa
2                 1       setosa
3                 1       setosa
4                 1       setosa
5                 1       setosa
6                 1       setosa
7                 1       setosa
8                 1       setosa
9                 1       setosa
10                1       setosa
11                1       setosa
12                1       setosa
13                1       setosa
14                1       setosa
15                1       setosa
16                1       setosa
17                1       setosa
18                1       setosa
19                1       setosa
# Centroids of the clusters
array([[6.23658537, 2.85853659, 4.80731707, 1.62195122],
       [5.006     , 3.428     , 1.462     , 0.246     ],
       [6.9125    , 3.1       , 5.846875  , 2.13125   ],
       [5.52962963, 2.62222222, 3.94074074, 1.21851852]])
# Set colours to the clusters to differentiate(Not required in main impelemntation)

color_scheme = np.array(['red','blue','green','yellow','pink','cyan'])
array(['red', 'blue', 'green', 'yellow', 'pink', 'cyan'], dtype='<U6')
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
#Converting the categorical data into numerical data.

data1['species'] = data1['species'].replace({"setosa":1,"versicolor":2,"virginica":0})

1    50
2    50
0    50
Name: count, dtype: int64
# Visualize the clusters in the original data(Not required in main implementation)

plt.scatter(data1.petal_length,data1.petal_width,color=color_scheme[data1.species]) #Original

# visualize the clusters formed by the model(Not required in main impelemntation)

plt.scatter(X.petal_length,X.petal_width,color=color_scheme[model1.labels_])  #After Training

Elbow Technique:

  • It is most popular method that is used to determine the optimal value of K

How to works

  • Start with some K

  • Calculate WCSS(Within Cluster Sum of sqaured errors)i.e: for each of the cluster

    Mathematically, the formula for WCSS is as follows:

    WCSS = Σ (Σ ||x – c||^2)


    WCSS is the Within-Cluster Sum of Squares.

    Σ represents the summation symbol, which means summing up the values for all data points and clusters.

    ||x – c||^2 is the squared Euclidean distance between a data point “x” and the centroid “c” of its assigned cluster.

  • It calculate the distance of individual data points from the centroid, then sqaure it and sum it up.

    WCSS = WCSS1 + WCSS2 + ….+ WCSSk

  • Take new value of K , repear step2.

  • For each number of K, WCSS is calculated.

  • Find the elbow point. That is the optimal value of K.

  • In image , number of K increases, the error reduces.

Determine K using elbow method:

# create black list for WCSS

wcss = []

for i in range(1,11):
  model1 = KMeans(n_clusters = i,random_state=24)
  wcss.append(model1.inertia_) # .inertia_ will give distance between cenntroids & all the other points

# Plotting WCSS graph
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS or Inertia')

  • To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie., the point after which the distortion/inertia start decreasing in a linear fashion.
  • We can take 4 here as optimal K value

Model Evaluation

from sklearn.metrics import silhouette_score
score = silhouette_score(X,label1)


