**DBSCAN:**

- It stands for
**Density-Based Spatial Clustering of Applications with Noise**. - It was proposed by Martin Ester et al. in 1996.
- DBSCAN is a density-based clustering algorithm that works on the assumption that
**clusters**are**dense regions**in space separated by regions of**lower density**. - The most exciting feature of DBSCAN clustering is that it is
**robust**to**outliers**. - It also
**does not**require the**number of clusters**to be told beforehand, unlike K-Means, where we have to specify the number of centroids. - DBSCAN requires only two parameters:
**epsilon**and**minPoints**. **Epsilon**is the radius of the circle to be created around each data point to check the density and**minPoints**is the minimum number of data points required inside that circle for that data point to be classified as a**Core point**.

**How it works:**

DBSCAN

**creates a circle**of epsilon radius around**every data point**and – classifies them into**Core point**,**Border point**, and**Noise**.A data point is a

**Core point**if the circle around it contains at least ‘minPoints’ number of points.If the number of points is less than minPoints, then it is classified as

**Border Point**, andIf there are no other data points around any data point within epsilon radius, then it treated as

**Noise**.The figure shows us a cluster created by DBCAN with

**minPoints = 3**. Here, we draw a circle of equal radius epsilon around every data point. These two parameters help in creating spatial clusters.- All the data points with at least 3 points in the circle including itself are considered as
**Core points**represented by**red color**. - All the data points with less than 3 but greater than 1 point in the circle including itself are considered as
**Border points**. They are represented by**yellow color**. - Finally, data points with no point other than itself present inside the circle are considered as
**Noise**represented by the**purple color**.

- All the data points with at least 3 points in the circle including itself are considered as
For locating data points in space, DBSCAN uses

**Euclidean distance**.

**Reachability and Connectivity:**

These are the two concepts that you need to understand before moving further. Reachability states if a data point can be accessed from another data point directly or indirectly, whereas Connectivity states whether two data points belong to the same cluster or not. In terms of reachability and connectivity, two points in DBSCAN can be referred to as:

- Directly Density-Reachable
- Density-Reachable
- Density-Connected

Let’s understand what they are.

- X is
**directly density-reachable**from point Y w.r.t epsilon, minPoints if, X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon Y is a core point.

Here, X is directly density-reachable from Y, but vice versa, it is not valid.

- A point X is
**density-reachable**from point Y w.r.t epsilon, minPoints if there is a chain of points p1, p2, p3, …, pn and p1=X and pn=Y such that pi+1 is directly density-reachable from pi.

Here, X is density-reachable from Y with X being directly density-reachable from P2, P2 from P3, and P3 from Y. But, the inverse of this is not valid.

- A point X is
**density-connected**from point Y w.r.t epsilon and minPoints if there exists a point O such that both X and Y are density-reachable from O w.r.t to epsilon and minPoints.

Here, both X and Y are density-reachable from O, therefore, we can say that X is density-connected from Y.

**Parameter Selection in DBSCAN Clustering:**

The

**value of minPoints**should be at least one greater than the number of dimensions of the dataset, i.e.,**minPoints>=Dimensions+1**It does not make sense to take minPoints as 1 because it will result in each point being a separate cluster. Therefore, it must be at least 3.

The

**value of epsilon**can be decided from the K-distance graph. The point of maximum curvature (elbow) in this graph tells us about the value of epsilon. If the**value of epsilon**chosen is**too small**then a**higher number of clusters**will be created, and**more data points**will be taken as**noise**.If

**value of epsilon**chosen**too big**then various**small clusters**will merge into a big cluster, and we will**lose details**.

**Advantages**

DBSCAN doesn’t require users to specify the

**number of clusters.**DBSCAN is

**NOT**sensitive to**outliers.**The clusters formed by DBSCAN can be

**any shape**, which makes it robust to different types of data.

**Disadvantages**

It would be a big concern to use DBSCAN if the data has a

**very large**variation in densities across clusters because you can only use one pair of parameters, eps and MinPts, on one dataset.In addition, it could be super hard to define eps without the domain knowledge of the data.

**Python Implementation for DBSCAN Clustering**

```
# import libraries
from sklearn.cluster import DBSCAN
import numpy as np
```

```
# Creating data
X1 = data1.iloc[:,0:-1]
```

```
# Creating Model
model2 = DBSCAN(eps=0.5, min_samples=5)
label2 = model2.fit_predict(X1)
```

```
# Evaluation
from sklearn.metrics import silhouette_score
score = silhouette_score(X,label2)
print(score)
```

Output: 0.4860341970345691