Model Evaluation for Unsupervised Machine Learning

To evaluate the quality of clustering in unsupervised machine learning, Silhouette Score is commonly used metric which measures the similarity of each data point to its own cluster (cohesion) compared to the similarity to the nearest neighboring cluster (separation). 

Formula: The formula for the Silhouette Score for a single data point i is as 
follows:

        S(i) = (b(i) - a(i)) / max{a(i), b(i)}
Where:

- S(i) is the silhouette score for data point i.
- a(i) is the average distance from data point i to all other data points in the same cluster (cohesion).
- b(i) is the smallest average distance from data point i to all data points in any other cluster, except its own (separation).

Interpretation: The Silhouette Score ranges from -1 to 1. It's interpretation as below:

      01. A high positive value (close to 1) indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters.
      02. A score near 0 suggests that the data point is on or very close to the decision boundary between two neighboring clusters.
      03. A negative score (close to -1) indicates that the data point may have been assigned to the wrong cluster.

Python Code: 

# Import necessary libraries
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# Generate some example data for clustering
data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Specify the number of clusters you want to test
n_clusters = 2

# Fit a clustering model (e.g., K-Means) to your data
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
cluster_labels = kmeans.fit_predict(data)

# Compute the Silhouette Score
silhouette_avg = silhouette_score(data, cluster_labels)

# Print the Silhouette Score
print(f"Silhouette Score: {silhouette_avg}")
Register

Login here

Forgot your password?

Subscribe to our email list

Data4Fashion