K-Nearest Neighbors(KNN)

K-Nearest Neighbors(KNN):

Supervised machine learning algorithm
Use for classification and regression problem
In case of Classification problem, Predict correct class of test data by calculating distance between test data and training data to find K number of closest data points. Calculates the probability of the test data belonging to the classes of K training points & class with the highest probability to select.
In case of Regression problem, predict value which is mean of the K selected training data(y-value)
- In case of Parametric approach(Specific distribution) or linear problem, Linear regression good over KNN
- In case of Non-Parametric approach(No Specific distribution) or Non-linear problem, KNN is good over linear regression

How KNN works for Classification problem:

We have new data or test data and 2 classes.
Choose K value
Calculate distance & select K training data which are nearest to the test data
Find out the probability of each class
Select the highest probability class
Example:
- 2 classes highlighted in Blue & Red
- Test data highlighted in Yellow
- From the below figure if we take K = 1, it seems my yellow sample closest to red triangle class
- If we take K = 3 ,
  - Number of red class values = 1
  - Number of blue class values = 2
  - Probability(red) = 1/3
  - Probability(blue) = 2/3
  Due to the highest probability, my yellow sample belongs to blue star class.
Different values of K gives different result.
Best K value is determined by performing cross-validation and minimizing error.

How KNN works for Regression problems:

Take a sample( Blue/Yellow Star) which output we should predict
Choose K value, nearest x points

Taking y-values of all neighbors x values & find arithmetic mean .
This arithmetic mean is predicted y value
- Consider K = 1, y-value of star will be y value of red point which is 7
- Consider K = 2, y-values of red & orange points are 7,8, so y-value of star will be the arithmetic mean of ((7+8)/2 = 7.5
- Consider K = 3, y-values of red, orange, and green are 7,8,4, so y-value of star will be the arithmetic mean of ((7+8+4)/3 = 6.33

Distances Metrics:

Most commonly Minkowski metric is used to calculate distance

For two dimensions:

Minkowski Distance :
d= (|X1 – X2|^p + |Y1 – Y2|^p) ^(1/p), p=power
Manhattan Distance :
if p=1, it is Manhattan distance d= |X1 – X2| + |Y1 – Y2|
Euclidean Distance :
if p=2, it is Euclidean Distance
d= (|X1 – X2|^2 + |Y1 – Y2|^2) ^(1/2)
or = √ ((X1 – X2)^2 + (Y1 – Y2)^2)
It is also called the Pythagorean theorem

For three dimensions:

Minkowski Distance :
d= (|X1 – X2|^p + |Y1 – Y2|^p + |Z1-Z2|^p) ^(1/p), p=power
Manhattan Distance :
if p=1, it is the Manhattan distance
d= |X1 – X2| + |Y1 – Y2| + |Z1 – Z2|
Euclidean Distance :
if p=2, it is Euclidean Distance
d= (|X1 – X2|^2 + |Y1 – Y2|^2 + |Z1 – Z2) ^(1/2)
or = √ ((X1 – X2)^2 + (Y1 – Y2)^2 + (Z1 – Z2)^2)

Pros:

Most Intuitive(easy to understand) ,
Easy-to-implement ,
Fitting process is very time efficient ,
Non-parametric-so easily adjustable to new data
Hypertuning is very straightforward

Cons:

Not preferable choice for extrapolation tasks
Needs more data to make good prediction compared to parametric models
Fitting process can take up too much memory.
Testing can be slow for big data sets.
Can be suffer from curse of dimensionality
Not preferable for datasets with categorical features.
KNN sensitive to outliers & imbalanced data

Application of KNN:

Recommending ads to display to a user(Youtube) or Products (Amazon) to user.
Finding individula credit rating
Video recognition, image recognition , text detection by advanced KNN

Why is KNN called Lazy Learners?

k-NN algorithms are often termed as Lazy learners. Let’s understand why is that.
Most of the algorithms like Bayesian classification, logistic regression, SVM etc., are called Eager learners.
These algorithms generalize over the training set before receiving the test data i.e. they create a model based on the training data before receiving the test data and then do the prediction/classification on the test data.
But this is not the case with the k-NN algorithm. It doesn’t create a generalized model for the training set but waits for the test data.
Once test data is provided then only it starts generalizing the training data to classify the test data.
So, a lazy learner just stores the training data and waits for the test set. Such algorithms work less while training and more while classifying a given test dataset.

Python Implementation of KNN for Classification Problem:

Business Case: To predict whether a person will have diabetics or not

As the answer is yes or no, it is a classification problem
We will use the KNN Classifier in this case

# import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings

# Load dataset

data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /05_KNN/KNN Class/diabetes.csv')
data.head()

Output:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            2      138             62             35        0  33.6                     0.127   47        1
1            0       84             82             31      125  38.2                     0.233   23        0
2            0      145              0              0        0  44.2                     0.630   31        1
3            0      135             68             42      250  42.3                     0.365   24        1
4            1      139             62             41      480  40.7                     0.536   21        0

Basic Checks:

data.describe()

Output:
       Pregnancies      Glucose  BloodPressure  SkinThickness      Insulin          BMI  DiabetesPedigreeFunction          Age      Outcome
count  2000.000000  2000.000000    2000.000000    2000.000000  2000.000000  2000.000000               2000.000000  2000.000000  2000.000000
mean      3.703500   121.182500      69.145500      20.935000    80.254000    32.193000                  0.470930    33.090500     0.342000
std       3.306063    32.068636      19.188315      16.103243   111.180534     8.149901                  0.323553    11.786423     0.474498
min       0.000000     0.000000       0.000000       0.000000     0.000000     0.000000                  0.078000    21.000000     0.000000
25%       1.000000    99.000000      63.500000       0.000000     0.000000    27.375000                  0.244000    24.000000     0.000000
50%       3.000000   117.000000      72.000000      23.000000    40.000000    32.300000                  0.376000    29.000000     0.000000
75%       6.000000   141.000000      80.000000      32.000000   130.000000    36.800000                  0.624000    40.000000     1.000000
max      17.000000   199.000000     122.000000     110.000000   744.000000    80.600000                  2.420000    81.000000     1.000000

data.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               2000 non-null   int64  
 1   Glucose                   2000 non-null   int64  
 2   BloodPressure             2000 non-null   int64  
 3   SkinThickness             2000 non-null   int64  
 4   Insulin                   2000 non-null   int64  
 5   BMI                       2000 non-null   float64
 6   DiabetesPedigreeFunction  2000 non-null   float64
 7   Age                       2000 non-null   int64  
 8   Outcome                   2000 non-null   int64  
dtypes: float64(2), int64(7)
memory usage: 140.8 KB

data.isnull().sum()

Output:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Observation from basic Checks

8 independent features & 1 dependent feature which is the outcome present in the dataset
No of the rows are 2000
Values of all features are not in the same scale, so scaling down required
No Categorical feature present in dataset
Glucose, BloodPressure, SkinThickness, and Insulin value can not be zero,so there are irrelevant data present
No null values present in the data set

Exploratory Data Analysis

# Univariate analysis

plt.figure(figsize=(8,7))
plotnumber =1

for column in data:
  if plotnumber<=9:
    ax = plt.subplot(3,3,plotnumber)
    sns.histplot(data[column],kde=True)
    plt.xlabel(column)
  plotnumber+=1
plt.show()

Observation from Univariate analysis

Pregnancies,DiabetesPedigree, Age right skewed distributed
Glucose,Bloodpressure,skin thickness, insulin,BMI normally distributes but it has outliers, also Bloospressure can not be zero

Bivariate analysis & Multivariate analysis : Skipping for now

Data Preprocesssing:

# Replacing zero values with Median as it continuous numerical number & it has outliers

data['Glucose'].replace(0,data['Glucose'].median(),inplace=True)
data['BloodPressure'].replace(0,data['BloodPressure'].median(),inplace=True)
data['SkinThickness'].replace(0,data['SkinThickness'].median(),inplace=True)
data['Insulin'].replace(0,data['Insulin'].median(),inplace=True)
data['BMI'].replace(0,data['BMI'].median(),inplace=True)

# Univariate analysis after Replacing Zero

plt.figure(figsize=(8,7))
plotnumber =1

for column in data:
  if plotnumber<=9:
    ax = plt.subplot(3,3,plotnumber)
    sns.histplot(data[column],kde=True)
    plt.xlabel(column)
  plotnumber+=1
plt.show()

Feature Engineering & Selection:

# Creating independent & dependent variables

X = data.drop('Outcome',axis=1)
y = data['Outcome']

# Creating training & testing data

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=24)

# Scaling down data

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_scaled_train = sc.fit_transform(X_train)
X_scaled_test = sc.transform(X_test)

Model Creation

# Creating KNN Classification model

from sklearn.neighbors import KNeighborsClassifier

# Create a list to store error values for each K
error_rate =  []
accuracy_rate =  []

for i in range(1,11):
  model = KNeighborsClassifier(n_neighbors = i)
  model.fit(X_scaled_train,y_train)
  y_predict = model.predict(X_scaled_test)
  error_rate.append(np.mean(y_predict!=y_test))
  accuracy_rate.append(1-np.mean(y_predict!=y_test))

# Using Cross validation score

from sklearn.model_selection import cross_val_score

accuracy_rate1 = []
error_rate1 = []

for i in range(1,11):
  model = KNeighborsClassifier(n_neighbors=i)
  score = cross_val_score(model,X_scaled_train,y_train,cv=10)
  accuracy_rate1.append(score.mean())
  error_rate1.append(1-score.mean())

# Plot K-value and error rate

plt.figure(figsize=(8,5))
plt.plot(range(1,11),error_rate1,color='blue',marker='o',linestyle='-.')
plt.plot(range(1,11),accuracy_rate1,color='red',marker='^',linestyle='-.')

plt.plot(range(1,11),error_rate,color='green',marker='o',linestyle='-.')
plt.plot(range(1,11),accuracy_rate,color='cyan',marker='^',linestyle='-.')

plt.title('Error & Accuracy Rate vs. K Value')
plt.xlabel('K-value')
plt.ylabel('Error & Accuracy Rate')

Note : Either accuracy score or error rate should be used with any one method

# Final model with best K value
model1 = KNeighborsClassifier(n_neighbors = 4)
model1.fit(X_scaled_train,y_train)
y_predict1 = model1.predict(X_scaled_test)

Model Evaluation:

from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_test,y_predict1)

Output:
0.782

print(classification_report(y_test,y_predict1))

Output:
   precision    recall  f1-score   support

           0       0.78      0.92      0.85       328
           1       0.78      0.51      0.62       172

    accuracy                           0.78       500
   macro avg       0.78      0.72      0.73       500
weighted avg       0.78      0.78      0.77       500

Python Implementation of KNN for Regression Problem:

# import libraries

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# Creating dataset
inputs, target = make_regression(n_samples = 300, 
                                 n_features = 1, 
                                 noise = 15,
                                 random_state = 365)
target = target/100 # not necessary, it is to minimize value

# Creating training & testing data
x_train, x_test, y_train, y_test = train_test_split(inputs, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = 365)

# Create an array where all predictions from the KNN regressions will be stored
y_pred_knn = []

# Create KNN regression models with K = 1, 10, and 40
for i in [1, 10, 40]:
    # Create an instance of the KNN regressor with the specified number of neighbors
    reg_knn = KNeighborsRegressor(n_neighbors = i)
    # Fit the model to the training data
    reg_knn.fit(x_train, y_train)
    # Make predictions on the test data and store it in the y_pred_knn variable
    y_pred_knn.append(reg_knn.predict(x_test))

# Selecting K value

mse_knn = []

for i in range(1, 41):
    reg_knn = KNeighborsRegressor(n_neighbors = i)
    reg_knn.fit(x_train, y_train)
    y_pred_knn = reg_knn.predict(x_test)
    mse_knn.append(mean_squared_error(y_test, y_pred_knn))

sns.set()

fig, ax = plt.subplots()

ax.plot(list(range(1, 41)), 
        mse_knn, 
        color = 'red', 
        marker = 'o', 
        markerfacecolor = '#000C1F', 
        label = 'KNN')

ax.legend(loc='lower right')
ax.set_title('Mean-Squared Error (MSE)')
ax.set_xlabel('K')
ax.set_ylabel('MSE')

#Final Model

reg_knn1 = KNeighborsRegressor(n_neighbors = 10)
   
reg_knn1.fit(x_train, y_train)
    
y_pred_knn1 = reg_knn1.predict(x_test)

neighbors = reg_knn1.kneighbors([[0.5]])
neighbors

Output:
(array([[0.00343521, 0.01076282, 0.02350518, 0.02668283, 0.03571938,
         0.0358432 , 0.03606684, 0.04745948, 0.05127815, 0.05897151]]),
 array([[203, 108, 113,   8,  10,  22, 133,  45,  26, 131]]))

# Evaluation 
MSE = mean_squared_error(y_test, y_pred_knn1)
MSE

Output:
0.03106708890252071