K-Nearest Neighbors(KNN):
Supervised machine learning algorithm
Use for classification and regression problem
In case of Classification problem, Predict correct class of test data by calculating distance between test data and training data to find K number of closest data points. Calculates the probability of the test data belonging to the classes of K training points & class with the highest probability to select.
In case of Regression problem, predict value which is mean of the K selected training data(y-value)
In case of Parametric approach(Specific distribution) or linear problem, Linear regression good over KNN
In case of Non-Parametric approach(No Specific distribution) or Non-linear problem, KNN is good over linear regression
How KNN works for Classification problem:
We have new data or test data and 2 classes.
Choose K value
Calculate distance & select K training data which are nearest to the test data
Find out the probability of each class
Select the highest probability class
Example:
2 classes highlighted in Blue & Red
Test data highlighted in Yellow
From the below figure if we take K = 1, it seems my yellow sample closest to red triangle class
If we take K = 3 ,
- Number of red class values = 1
- Number of blue class values = 2
- Probability(red) = 1/3
- Probability(blue) = 2/3
Due to the highest probability, my yellow sample belongs to blue star class.
Different values of K gives different result.
Best K value is determined by performing cross-validation and minimizing error.
How KNN works for Regression problems:
Take a sample( Blue/Yellow Star) which output we should predict
Choose K value, nearest x points
Taking y-values of all neighbors x values & find arithmetic mean .
This arithmetic mean is predicted y value
- Consider K = 1, y-value of star will be y value of red point which is 7
- Consider K = 2, y-values of red & orange points are 7,8, so y-value of star will be the arithmetic mean of ((7+8)/2 = 7.5
- Consider K = 3, y-values of red, orange, and green are 7,8,4, so y-value of star will be the arithmetic mean of ((7+8+4)/3 = 6.33
Distances Metrics:
Most commonly Minkowski metric is used to calculate distance
For two dimensions:
Minkowski Distance :
d= (|X1 – X2|^p + |Y1 – Y2|^p) ^(1/p), p=power
Manhattan Distance :
if p=1, it is Manhattan distance d= |X1 – X2| + |Y1 – Y2|
Euclidean Distance :
if p=2, it is Euclidean Distance
d= (|X1 – X2|^2 + |Y1 – Y2|^2) ^(1/2)
or = √ ((X1 – X2)^2 + (Y1 – Y2)^2)
It is also called the Pythagorean theorem
For three dimensions:
Minkowski Distance :
d= (|X1 – X2|^p + |Y1 – Y2|^p + |Z1-Z2|^p) ^(1/p), p=power
Manhattan Distance :
if p=1, it is the Manhattan distance
d= |X1 – X2| + |Y1 – Y2| + |Z1 – Z2|
Euclidean Distance :
if p=2, it is Euclidean Distance
d= (|X1 – X2|^2 + |Y1 – Y2|^2 + |Z1 – Z2) ^(1/2)
or = √ ((X1 – X2)^2 + (Y1 – Y2)^2 + (Z1 – Z2)^2)
Pros:
- Most Intuitive(easy to understand) ,
- Easy-to-implement ,
- Fitting process is very time efficient ,
- Non-parametric-so easily adjustable to new data
- Hypertuning is very straightforward
Cons:
- Not preferable choice for extrapolation tasks
- Needs more data to make good prediction compared to parametric models
- Fitting process can take up too much memory.
- Testing can be slow for big data sets.
- Can be suffer from curse of dimensionality
- Not preferable for datasets with categorical features.
- KNN sensitive to outliers & imbalanced data
Application of KNN:
Recommending ads to display to a user(Youtube) or Products (Amazon) to user.
Finding individula credit rating
- Video recognition, image recognition , text detection by advanced KNN
Why is KNN called Lazy Learners?
k-NN algorithms are often termed as Lazy learners. Let’s understand why is that.
Most of the algorithms like Bayesian classification, logistic regression, SVM etc., are called Eager learners.
These algorithms generalize over the training set before receiving the test data i.e. they create a model based on the training data before receiving the test data and then do the prediction/classification on the test data.
But this is not the case with the k-NN algorithm. It doesn’t create a generalized model for the training set but waits for the test data.
Once test data is provided then only it starts generalizing the training data to classify the test data.
So, a lazy learner just stores the training data and waits for the test set. Such algorithms work less while training and more while classifying a given test dataset.
Python Implementation of KNN for Classification Problem:
Business Case: To predict whether a person will have diabetics or not
As the answer is yes or no, it is a classification problem
We will use the KNN Classifier in this case
# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings
# Load dataset
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /05_KNN/KNN Class/diabetes.csv')
data.head()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 2 138 62 35 0 33.6 0.127 47 1 1 0 84 82 31 125 38.2 0.233 23 0 2 0 145 0 0 0 44.2 0.630 31 1 3 0 135 68 42 250 42.3 0.365 24 1 4 1 139 62 41 480 40.7 0.536 21 0
Basic Checks:
data.describe()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome count 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 mean 3.703500 121.182500 69.145500 20.935000 80.254000 32.193000 0.470930 33.090500 0.342000 std 3.306063 32.068636 19.188315 16.103243 111.180534 8.149901 0.323553 11.786423 0.474498 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000 25% 1.000000 99.000000 63.500000 0.000000 0.000000 27.375000 0.244000 24.000000 0.000000 50% 3.000000 117.000000 72.000000 23.000000 40.000000 32.300000 0.376000 29.000000 0.000000 75% 6.000000 141.000000 80.000000 32.000000 130.000000 36.800000 0.624000 40.000000 1.000000 max 17.000000 199.000000 122.000000 110.000000 744.000000 80.600000 2.420000 81.000000 1.000000
data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 2000 non-null int64 1 Glucose 2000 non-null int64 2 BloodPressure 2000 non-null int64 3 SkinThickness 2000 non-null int64 4 Insulin 2000 non-null int64 5 BMI 2000 non-null float64 6 DiabetesPedigreeFunction 2000 non-null float64 7 Age 2000 non-null int64 8 Outcome 2000 non-null int64 dtypes: float64(2), int64(7) memory usage: 140.8 KB
data.isnull().sum()
Output: Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
Observation from basic Checks
8 independent features & 1 dependent feature which is the outcome present in the dataset
No of the rows are 2000
Values of all features are not in the same scale, so scaling down required
No Categorical feature present in dataset
Glucose, BloodPressure, SkinThickness, and Insulin value can not be zero,so there are irrelevant data present
No null values present in the data set
Exploratory Data Analysis
# Univariate analysis
plt.figure(figsize=(8,7))
plotnumber =1
for column in data:
if plotnumber<=9:
ax = plt.subplot(3,3,plotnumber)
sns.histplot(data[column],kde=True)
plt.xlabel(column)
plotnumber+=1
plt.show()
Observation from Univariate analysis
Pregnancies,DiabetesPedigree, Age right skewed distributed
Glucose,Bloodpressure,skin thickness, insulin,BMI normally distributes but it has outliers, also Bloospressure can not be zero
Bivariate analysis & Multivariate analysis : Skipping for now
Data Preprocesssing:
# Replacing zero values with Median as it continuous numerical number & it has outliers
data['Glucose'].replace(0,data['Glucose'].median(),inplace=True)
data['BloodPressure'].replace(0,data['BloodPressure'].median(),inplace=True)
data['SkinThickness'].replace(0,data['SkinThickness'].median(),inplace=True)
data['Insulin'].replace(0,data['Insulin'].median(),inplace=True)
data['BMI'].replace(0,data['BMI'].median(),inplace=True)
# Univariate analysis after Replacing Zero
plt.figure(figsize=(8,7))
plotnumber =1
for column in data:
if plotnumber<=9:
ax = plt.subplot(3,3,plotnumber)
sns.histplot(data[column],kde=True)
plt.xlabel(column)
plotnumber+=1
plt.show()
Feature Engineering & Selection:
# Creating independent & dependent variables
X = data.drop('Outcome',axis=1)
y = data['Outcome']
# Creating training & testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=24)
# Scaling down data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_scaled_train = sc.fit_transform(X_train)
X_scaled_test = sc.transform(X_test)
Model Creation
# Creating KNN Classification model
from sklearn.neighbors import KNeighborsClassifier
# Create a list to store error values for each K
error_rate = []
accuracy_rate = []
for i in range(1,11):
model = KNeighborsClassifier(n_neighbors = i)
model.fit(X_scaled_train,y_train)
y_predict = model.predict(X_scaled_test)
error_rate.append(np.mean(y_predict!=y_test))
accuracy_rate.append(1-np.mean(y_predict!=y_test))
# Using Cross validation score
from sklearn.model_selection import cross_val_score
accuracy_rate1 = []
error_rate1 = []
for i in range(1,11):
model = KNeighborsClassifier(n_neighbors=i)
score = cross_val_score(model,X_scaled_train,y_train,cv=10)
accuracy_rate1.append(score.mean())
error_rate1.append(1-score.mean())
# Plot K-value and error rate
plt.figure(figsize=(8,5))
plt.plot(range(1,11),error_rate1,color='blue',marker='o',linestyle='-.')
plt.plot(range(1,11),accuracy_rate1,color='red',marker='^',linestyle='-.')
plt.plot(range(1,11),error_rate,color='green',marker='o',linestyle='-.')
plt.plot(range(1,11),accuracy_rate,color='cyan',marker='^',linestyle='-.')
plt.title('Error & Accuracy Rate vs. K Value')
plt.xlabel('K-value')
plt.ylabel('Error & Accuracy Rate')
Note : Either accuracy score or error rate should be used with any one method
# Final model with best K value
model1 = KNeighborsClassifier(n_neighbors = 4)
model1.fit(X_scaled_train,y_train)
y_predict1 = model1.predict(X_scaled_test)
Model Evaluation:
from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_test,y_predict1)
Output: 0.782
print(classification_report(y_test,y_predict1))
Output: precision recall f1-score support 0 0.78 0.92 0.85 328 1 0.78 0.51 0.62 172 accuracy 0.78 500 macro avg 0.78 0.72 0.73 500 weighted avg 0.78 0.78 0.77 500
Python Implementation of KNN for Regression Problem:
# import libraries
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Creating dataset
inputs, target = make_regression(n_samples = 300,
n_features = 1,
noise = 15,
random_state = 365)
target = target/100 # not necessary, it is to minimize value
# Creating training & testing data
x_train, x_test, y_train, y_test = train_test_split(inputs,
target,
test_size = 0.2,
random_state = 365)
# Create an array where all predictions from the KNN regressions will be stored
y_pred_knn = []
# Create KNN regression models with K = 1, 10, and 40
for i in [1, 10, 40]:
# Create an instance of the KNN regressor with the specified number of neighbors
reg_knn = KNeighborsRegressor(n_neighbors = i)
# Fit the model to the training data
reg_knn.fit(x_train, y_train)
# Make predictions on the test data and store it in the y_pred_knn variable
y_pred_knn.append(reg_knn.predict(x_test))
# Selecting K value
mse_knn = []
for i in range(1, 41):
reg_knn = KNeighborsRegressor(n_neighbors = i)
reg_knn.fit(x_train, y_train)
y_pred_knn = reg_knn.predict(x_test)
mse_knn.append(mean_squared_error(y_test, y_pred_knn))
sns.set()
fig, ax = plt.subplots()
ax.plot(list(range(1, 41)),
mse_knn,
color = 'red',
marker = 'o',
markerfacecolor = '#000C1F',
label = 'KNN')
ax.legend(loc='lower right')
ax.set_title('Mean-Squared Error (MSE)')
ax.set_xlabel('K')
ax.set_ylabel('MSE')
#Final Model
reg_knn1 = KNeighborsRegressor(n_neighbors = 10)
reg_knn1.fit(x_train, y_train)
y_pred_knn1 = reg_knn1.predict(x_test)
neighbors = reg_knn1.kneighbors([[0.5]])
neighbors
Output: (array([[0.00343521, 0.01076282, 0.02350518, 0.02668283, 0.03571938, 0.0358432 , 0.03606684, 0.04745948, 0.05127815, 0.05897151]]), array([[203, 108, 113, 8, 10, 22, 133, 45, 26, 131]]))
# Evaluation
MSE = mean_squared_error(y_test, y_pred_knn1)
MSE
Output: 0.03106708890252071