K-Means Clustering

Clustering:

Clustering is an unsupervised approach which finds a structure or pattern in a collection of unlabeled data.
A cluster is a collection of objects which are similar amongst themselves and are dissimilar to the objects belonging to a different cluster.
In Other words, clustering identifies homogeneous subgroups among the observations.
Clustering is a data analysis technique that groups similar observations or data points together into homogeneous subgroups.
The goal is to identify patterns or structures within the data, making it easier to understand and analyze.
By organizing data into clusters, you can gain insights into the underlying structure and relationships, which can be valuable for various applications such as customer segmentation, anomaly detection, and pattern recognition.
Let’s suppose we give a child different objects to group. How does a child make a group? The child may group over the colour, over the shape, over the hardness or softness of the objects etc.
In below figure, we can easily identify 4 different clusters. The criteria here is distance. Whichever points are near to each other are kept in the same cluster and the distant points belong to a different cluster

Types of Clustering:

K-Means Clustering
DBScan
Hierarchical clustering

K-Means Clustering:

An unsupervised machine learning technique to identify clusters of data objects in the dataset.
K specifies the number of clusters.
Steps involved in K-Means Clustering:
- First choose the number of K clusters.
- Start with K centroids by putting them at random place (not necessarily from your dataset). A centroid is a data point (imaginary or real) at the center of a cluster.
- Assign each point to the closest centroid. That forms K cluster.
- Compute the distance of every point from the centroid.
- Compute and place the new centroid for each cluster i.e: calaculate the mean value of the objects for each cluster and update the cluster mean. c_i = Σ x_j/N
Where:
– c_i is the new centroid for cluster “i.”
– N_i is the number of data points in cluster “i.”
– x_j represents the coordinates of each data point in the cluster “I.”
- Repeat the process again i.e: reassign each data point to the new closest centroid & go to step 4.
- If there is no change i.e: when clusters form a clear boundary, then stop

Python Implementation for K-Means Clustering:

Problem Statement: We need to cluster iris flowers into certain groups with the iris dataset

# import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Load the dataset

data1 = sns.load_dataset('iris')
data1.head()

Output:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Basic Checks:

data1.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

data1.describe()

Output:
      sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Skipping EDA, Data Preprocessing, Feature engineering

Feature selection:

# As it is unsupervised machine learning , we will take input variables only

X = data1.iloc[:,0:-1]
X.head()

Output:
  sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2

Model Creation

from sklearn.cluster import KMeans

model1 = KMeans(n_clusters=4,random_state=24)

label1 = model1.fit_predict(X)
label1

Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 3, 0, 3, 0,
       0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 3, 0, 0, 0,
       3, 3, 3, 0, 3, 3, 3, 3, 3, 0, 3, 3, 2, 0, 2, 2, 2, 2, 3, 2, 2, 2,
       0, 0, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 0], dtype=int32)

model1.inertia_

Output:
57.25600931571815

df= pd.DataFrame({'Predicted label':label1,'Actual label':data1['species']})

df.head(20)

Output:
   Predicted label Actual label
0                 1       setosa
1                 1       setosa
2                 1       setosa
3                 1       setosa
4                 1       setosa
5                 1       setosa
6                 1       setosa
7                 1       setosa
8                 1       setosa
9                 1       setosa
10                1       setosa
11                1       setosa
12                1       setosa
13                1       setosa
14                1       setosa
15                1       setosa
16                1       setosa
17                1       setosa
18                1       setosa
19                1       setosa

# Centroids of the clusters
model1.cluster_centers_

Output:
array([[6.23658537, 2.85853659, 4.80731707, 1.62195122],
       [5.006     , 3.428     , 1.462     , 0.246     ],
       [6.9125    , 3.1       , 5.846875  , 2.13125   ],
       [5.52962963, 2.62222222, 3.94074074, 1.21851852]])

# Set colours to the clusters to differentiate(Not required in main impelemntation)

color_scheme = np.array(['red','blue','green','yellow','pink','cyan'])
color_scheme

Output:
array(['red', 'blue', 'green', 'yellow', 'pink', 'cyan'], dtype='<U6')

data1.species.value_counts()

Output:
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

#Converting the categorical data into numerical data.

data1['species'] = data1['species'].replace({"setosa":1,"versicolor":2,"virginica":0})

data1.species.value_counts()

Output:
species
1    50
2    50
0    50
Name: count, dtype: int64

# Visualize the clusters in the original data(Not required in main implementation)

plt.scatter(data1.petal_length,data1.petal_width,color=color_scheme[data1.species]) #Original

# visualize the clusters formed by the model(Not required in main impelemntation)

plt.scatter(X.petal_length,X.petal_width,color=color_scheme[model1.labels_])  #After Training

Elbow Technique:

It is most popular method that is used to determine the optimal value of K

How to works

Start with some K
Calculate WCSS(Within Cluster Sum of sqaured errors)i.e: for each of the cluster
Mathematically, the formula for WCSS is as follows:
WCSS = Σ (Σ ||x – c||^2)
Where:
WCSS is the Within-Cluster Sum of Squares.
Σ represents the summation symbol, which means summing up the values for all data points and clusters.
||x – c||^2 is the squared Euclidean distance between a data point “x” and the centroid “c” of its assigned cluster.

It calculate the distance of individual data points from the centroid, then sqaure it and sum it up.
WCSS = WCSS1 + WCSS2 + ….+ WCSSk
Take new value of K , repear step2.
For each number of K, WCSS is calculated.
Find the elbow point. That is the optimal value of K.
In image , number of K increases, the error reduces.

Determine K using elbow method:

# create black list for WCSS

wcss = []

for i in range(1,11):
  model1 = KMeans(n_clusters = i,random_state=24)
  model1.fit(X)
  wcss.append(model1.inertia_) # .inertia_ will give distance between cenntroids & all the other points
wcss

Output:
[681.3706,
 152.3479517603579,
 78.851441426146,
 57.22847321428572,
 46.47223015873017,
 39.066035353535355,
 34.46437570762571,
 30.06459307359308,
 27.89549464570518,
 26.871427814624923]

plt.plot(wcss,marker='o')

# Plotting WCSS graph
plt.plot(range(1,11),wcss,marker='o')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS or Inertia')

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie., the point after which the distortion/inertia start decreasing in a linear fashion.
We can take 4 here as optimal K value

Model Evaluation

from sklearn.metrics import silhouette_score
score = silhouette_score(X,label1)
print(score)

Output:
0.49805050499728815

Clustering:

Types of Clustering:

K-Means Clustering:

Python Implementation for K-Means Clustering:

Model Creation

Elbow Technique:

Determine K using elbow method:

Model Evaluation

Legal Menu

Tutorial

Clustering:

Types of Clustering:

K-Means Clustering:

Python Implementation for K-Means Clustering:

Model Creation

Elbow Technique:

Determine K using elbow method:

Model Evaluation

Register

Login here

Forgot your password?

Subscribe to our email list

Legal Menu

Tutorial