Clustering:
Clustering is an unsupervised approach which finds a structure or pattern in a collection of unlabeled data.
A cluster is a collection of objects which are similar amongst themselves and are dissimilar to the objects belonging to a different cluster.
In Other words, clustering identifies homogeneous subgroups among the observations.
Clustering is a data analysis technique that groups similar observations or data points together into homogeneous subgroups.
The goal is to identify patterns or structures within the data, making it easier to understand and analyze.
By organizing data into clusters, you can gain insights into the underlying structure and relationships, which can be valuable for various applications such as customer segmentation, anomaly detection, and pattern recognition.
Let’s suppose we give a child different objects to group. How does a child make a group? The child may group over the colour, over the shape, over the hardness or softness of the objects etc.
In below figure, we can easily identify 4 different clusters. The criteria here is distance. Whichever points are near to each other are kept in the same cluster and the distant points belong to a different cluster
Types of Clustering:
- K-Means Clustering
- DBScan
- Hierarchical clustering
K-Means Clustering:
An unsupervised machine learning technique to identify clusters of data objects in the dataset.
K specifies the number of clusters.
Steps involved in K-Means Clustering:
- First choose the number of K clusters.
- Start with K centroids by putting them at random place (not necessarily from your dataset). A centroid is a data point (imaginary or real) at the center of a cluster.
- Assign each point to the closest centroid. That forms K cluster.
- Compute the distance of every point from the centroid.
- Compute and place the new centroid for each cluster i.e: calaculate the mean value of the objects for each cluster and update the cluster mean. c_i = Σ x_j/N
Where:
– c_i is the new centroid for cluster “i.”
– N_i is the number of data points in cluster “i.”
– x_j represents the coordinates of each data point in the cluster “I.”- Repeat the process again i.e: reassign each data point to the new closest centroid & go to step 4.
- If there is no change i.e: when clusters form a clear boundary, then stop
Python Implementation for K-Means Clustering:
Problem Statement: We need to cluster iris flowers into certain groups with the iris dataset
1 2 3 4 5 6 7 8 | # import necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings( 'ignore' ) |
1 2 3 4 | # Load the dataset data1 = sns.load_dataset( 'iris' ) data1.head() |
Output: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
Basic Checks:
1 | data1.info() |
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
1 | data1.describe() |
Output: sepal_length sepal_width petal_length petal_width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000
Skipping EDA, Data Preprocessing, Feature engineering
Feature selection:
1 2 3 4 | # As it is unsupervised machine learning , we will take input variables only X = data1.iloc[:, 0 : - 1 ] X.head() |
Output: sepal_length sepal_width petal_length petal_width 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
Model Creation
1 2 3 4 5 6 | from sklearn.cluster import KMeans model1 = KMeans(n_clusters = 4 ,random_state = 24 ) label1 = model1.fit_predict(X) label1 |
Output: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 3, 0, 3, 0, 0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 3, 0, 3, 3, 3, 3, 3, 0, 3, 3, 2, 0, 2, 2, 2, 2, 3, 2, 2, 2, 0, 0, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 0], dtype=int32)
1 | model1.inertia_ |
Output: 57.25600931571815
1 2 3 | df = pd.DataFrame({ 'Predicted label' :label1, 'Actual label' :data1[ 'species' ]}) df.head( 20 ) |
Output: Predicted label Actual label 0 1 setosa 1 1 setosa 2 1 setosa 3 1 setosa 4 1 setosa 5 1 setosa 6 1 setosa 7 1 setosa 8 1 setosa 9 1 setosa 10 1 setosa 11 1 setosa 12 1 setosa 13 1 setosa 14 1 setosa 15 1 setosa 16 1 setosa 17 1 setosa 18 1 setosa 19 1 setosa
1 2 | # Centroids of the clusters model1.cluster_centers_ |
Output: array([[6.23658537, 2.85853659, 4.80731707, 1.62195122], [5.006 , 3.428 , 1.462 , 0.246 ], [6.9125 , 3.1 , 5.846875 , 2.13125 ], [5.52962963, 2.62222222, 3.94074074, 1.21851852]])
1 2 3 4 | # Set colours to the clusters to differentiate(Not required in main impelemntation) color_scheme = np.array([ 'red' , 'blue' , 'green' , 'yellow' , 'pink' , 'cyan' ]) color_scheme |
Output: array(['red', 'blue', 'green', 'yellow', 'pink', 'cyan'], dtype='<U6')
1 | data1.species.value_counts() |
Output: species setosa 50 versicolor 50 virginica 50 Name: count, dtype: int64
1 2 3 4 5 | #Converting the categorical data into numerical data. data1[ 'species' ] = data1[ 'species' ].replace({ "setosa" : 1 , "versicolor" : 2 , "virginica" : 0 }) data1.species.value_counts() |
Output: species 1 50 2 50 0 50 Name: count, dtype: int64
1 2 3 | # Visualize the clusters in the original data(Not required in main implementation) plt.scatter(data1.petal_length,data1.petal_width,color = color_scheme[data1.species]) #Original |
1 2 3 | # visualize the clusters formed by the model(Not required in main impelemntation) plt.scatter(X.petal_length,X.petal_width,color = color_scheme[model1.labels_]) #After Training |
Elbow Technique:
- It is most popular method that is used to determine the optimal value of K
How to works
Start with some K
Calculate WCSS(Within Cluster Sum of sqaured errors)i.e: for each of the cluster
Mathematically, the formula for WCSS is as follows:
WCSS = Σ (Σ ||x – c||^2)
Where:
WCSS is the Within-Cluster Sum of Squares.
Σ represents the summation symbol, which means summing up the values for all data points and clusters.
||x – c||^2 is the squared Euclidean distance between a data point “x” and the centroid “c” of its assigned cluster.
It calculate the distance of individual data points from the centroid, then sqaure it and sum it up.
WCSS = WCSS1 + WCSS2 + ….+ WCSSk
Take new value of K , repear step2.
For each number of K, WCSS is calculated.
Find the elbow point. That is the optimal value of K.
In image , number of K increases, the error reduces.
Determine K using elbow method:
1 2 3 4 5 6 7 8 9 | # create black list for WCSS wcss = [] for i in range ( 1 , 11 ): model1 = KMeans(n_clusters = i,random_state = 24 ) model1.fit(X) wcss.append(model1.inertia_) # .inertia_ will give distance between cenntroids & all the other points wcss |
Output: [681.3706, 152.3479517603579, 78.851441426146, 57.22847321428572, 46.47223015873017, 39.066035353535355, 34.46437570762571, 30.06459307359308, 27.89549464570518, 26.871427814624923]
1 | plt.plot(wcss,marker = 'o' ) |
1 2 3 4 5 | # Plotting WCSS graph plt.plot( range ( 1 , 11 ),wcss,marker = 'o' ) plt.title( 'The Elbow Method' ) plt.xlabel( 'Number of clusters' ) plt.ylabel( 'WCSS or Inertia' ) |
- To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie., the point after which the distortion/inertia start decreasing in a linear fashion.
- We can take 4 here as optimal K value
Model Evaluation
1 2 3 | from sklearn.metrics import silhouette_score score = silhouette_score(X,label1) print (score) |
Output: 0.49805050499728815