DBSCAN:

  • It stands for Density-Based Spatial Clustering of Applications with Noise.
  • It was proposed by Martin Ester et al. in 1996.
  • DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density.
  • The most exciting feature of DBSCAN clustering is that it is robust to outliers.
  • It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.
  • DBSCAN requires only two parameters: epsilon and minPoints.
  • Epsilon is the radius of the circle to be created around each data point to check the density and
  • minPoints is the minimum number of data points required inside that circle for that data point to be classified as a Core point.

How it works:

  • DBSCAN creates a circle of epsilon radius around every data point and – classifies them into Core pointBorder point, and Noise.

  • A data point is a Core point if the circle around it contains at least ‘minPoints’ number of points.

  • If the number of points is less than minPoints, then it is classified as Border Point, and

  • If there are no other data points around any data point within epsilon radius, then it treated as Noise.

  • The figure shows us a cluster created by DBCAN with minPoints = 3. Here, we draw a circle of equal radius epsilon around every data point. These two parameters help in creating spatial clusters.

    • All the data points with at least 3 points in the circle including itself are considered as Core points represented by red color.
    • All the data points with less than 3 but greater than 1 point in the circle including itself are considered as Border points. They are represented by yellow color.
    • Finally, data points with no point other than itself present inside the circle are considered as Noise represented by the purple color.
  • For locating data points in space, DBSCAN uses Euclidean distance.

Reachability and Connectivity:

  • These are the two concepts that you need to understand before moving further. Reachability states if a data point can be accessed from another data point directly or indirectly, whereas Connectivity states whether two data points belong to the same cluster or not. In terms of reachability and connectivity, two points in DBSCAN can be referred to as:

    1. Directly Density-Reachable
    2. Density-Reachable
    3. Density-Connected

Let’s understand what they are.

  1. X is directly density-reachable from point Y w.r.t epsilon, minPoints if, X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon Y is a core point.

Here, X is directly density-reachable from Y, but vice versa, it is not valid.

  1. A point X is density-reachable from point Y w.r.t epsilon, minPoints if there is a chain of points p1, p2, p3, …, pn and p1=X and pn=Y such that pi+1 is directly density-reachable from pi.

Here, X is density-reachable from Y with X being directly density-reachable from P2, P2 from P3, and P3 from Y. But, the inverse of this is not valid.

  1. A point X is density-connected from point Y w.r.t epsilon and minPoints if there exists a point O such that both X and Y are density-reachable from O w.r.t to epsilon and minPoints.

Here, both X and Y are density-reachable from O, therefore, we can say that X is density-connected from Y.

Parameter Selection in DBSCAN Clustering:

  • The value of minPoints should be at least one greater than the number of dimensions of the dataset, i.e.,

    minPoints>=Dimensions+1

  • It does not make sense to take minPoints as 1 because it will result in each point being a separate cluster. Therefore, it must be at least 3.

  • The value of epsilon can be decided from the K-distance graph. The point of maximum curvature (elbow) in this graph tells us about the value of epsilon. If the value of epsilon chosen is too small then a higher number of clusters will be created, and more data points will be taken as noise.

  • If value of epsilon chosen too big then various small clusters will merge into a big cluster, and we will lose details.

Advantages

  • DBSCAN doesn’t require users to specify the number of clusters.

  • DBSCAN is NOT sensitive to outliers.

  • The clusters formed by DBSCAN can be any shape, which makes it robust to different types of data.

Disadvantages

  • It would be a big concern to use DBSCAN if the data has a very large variation in densities across clusters because you can only use one pair of parameters, eps and MinPts, on one dataset.

  • In addition, it could be super hard to define eps without the domain knowledge of the data.

Python Implementation for DBSCAN Clustering

# import libraries

from sklearn.cluster import DBSCAN
import numpy as np
# Creating data

X1 = data1.iloc[:,0:-1]
# Creating Model

model2 = DBSCAN(eps=0.5, min_samples=5)
label2 = model2.fit_predict(X1)
# Evaluation

from sklearn.metrics import silhouette_score
score = silhouette_score(X,label2)
print(score)
Output:
0.4860341970345691

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies