Hierarchical Clustering
The algorithm builds clusters by measuring the dissimilarities between data.
Hierarchical clustering groups data points and visualize the clusters using both a dendrogram and scatter plot
A Dendrogram is a tree-like diagram that records the sequences of merges or splits.
Approaches for Clustering:
The clustering approaches can be broadly divided into two categories: Agglomerative and Divisive.
Agglomerative:
- This approach first considers all the points as individual clusters and
- then finds out the similarity between two points, puts them into a cluster. – – Then it goes on finding similar points and clusters until there is only one cluster left i.e., all points belong to a big cluster.
- This is also called the bottom-up approach.
Divisive:
- It is opposite of the agglomerative approach.
- It first considers all the points to be part of one big cluster and in the subsequent steps tries to find out the points/ clusters which are least similar to each other and then breaks the bigger cluster into smaller ones.
- This continues until there are as many clusters as there are datapoints.
- This is also called the top-down approach.
How it works
We will use Agglomerative Clustering that follows a bottom up approach or Agglomerative.
We begin by treating each data point as its own cluster.
Then, we join clusters together that have the shortest distance between them to create larger clusters.
This step is repeated until one large cluster is formed containing all of the data points.
Hierarchical clustering requires us to decide on both a distance and linkage method.
We will use euclidean distance and the Ward linkage method, which attempts to minimize the variance between clusters.
Python Implementation for Hierarchical Clustering
# Import necessary libraries
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
# Creating Dendrogram
linkage_data = linkage(X)
dendrogram(linkage_data)
plt.show()
# Creating model
model3 = AgglomerativeClustering(n_clusters = 2,
affinity='euclidean',
linkage='ward')
label3 = model3.fit_predict(X)
# Evaluation
from sklearn.metrics import silhouette_score
score = silhouette_score(X,label3)
print(score)
Output: 0.6867350732769781