Support Vector Machine (SVM):
Supervised Machine Learning algorithm used for both classification and regression.
It tries to divide the data using hyperplanes and then makes the predictions.
It is a non-probabilistic linear classifier.
While other classifiers classify & predict the probability of a data point to belong to one group or another, SVM directly says to which group the data point belongs to without using any probability calculation.
How it works for classification:
Hyperplane:
SVM constructs a best line or the decision boundary called Hyperplane
It can be used for classification or regression or outlier detection.
The dimension of the hyperplane depends upon the number of features.
If the number of features is 2,then the hyperplane is just a line.
If the number of features is 3 ,then the hyperplane becomes a two-dimensional plane.
Marginal distance:
This hyperplane creates 2 margin lines parallel to it which have some distance so that it can distinctly classify the data points.
The distance between the 2 margin lines are called marginal distance
Support Vectors:
These 2 margin lines passes through the most nearest (+ve) points and the most nearest (-ve) points.
Those points through which the margin lines pass are called support vectors.
It helps to determine the maximum distance of the marginal plane.
Mathematical Intuition:
In image below, there are two groups of data.
To divide these points into two groups, we can use three lines.
There can be infinite number of straight lines that can divide these points into two classes.
Now which line to choose?
SVM solves this problem using the maximum margin as shown below
The Black Line in the middle is the optimum classifier.
This line is drawn to maximize the distance of the classifier line from the nearest points in the two classes. It is called as hyperplane in terms of SVM.
A Hyperplane is an n-dimensional plane that optimally divides the data of n dimensions only.
Here, as we have 2-D data, so the hyperplane can be represented using one dimension only. Hence, hyperplane is a line here.
The two points (highlighted with circles) which are on the yellow lines, they are called the support vectors.
As it 2-D figure, they are points. In multidimensional space, they will be vectors
From the name “support vector machine”, this algorithm creates the optimum classification line by maximizing its distance from the two support vectors.
For Higher Dimension
When the data is not linearly separable, then to create a hyperplane to separate data into different groups,
The SVM algorithm needs to perform computations in higher-dimensional space.
But the introduction of new dimensions makes the computations for the SVMs more intensive, which impacts the algorithm performance.
To rectify this, mathematicians came up with the approach of Kernel methods.
It uses Kernel functions.
The unique feature of a kernel function is to compute in a higher-dimensional space without calculating the new coordinates in that higher dimension.
It implicitly uses predefined mathematical functions to do operations on the existing points which mimic the computation in a higher-dimensional space without adding to the computation cost as they are not actually calculating the coordinates in the higher dimension thereby avoiding the computation of calculating distances from the newly computed points.
This is called the kernel trick.
In the left diagram below, we have a non-linear distribution of data as we can not classify data using a linear equation.
To solve this problem, we can project the points in a 3-dimensional space and then derive a plane that divides the data into two parts.
In theory, that’s what a kernel function does without computing the additional coordinates for the higher dimension.
How it works for Regression:
For Regression, determine the best fit line.
The idea is to create a line which minimises the total residual error.
The SVR approach is a bit different.
Instead of trying to minimise the error, SVR focuses on keeping the error in a fixed range.
This approach can be explained using three lines.
The first line is the best fit regressor line, and the other two lines are the bordering lines which denote the range of error.
It means that we are going to consider the points inside this ± error boundary only for preparing our model.
In other words, the best fit line(or the hyperplane) will be the line which goes through the maximum number of data points and the error boundaries are chosen to ensure maximum inclusion.
This error term can be customized using the epsilon parameter defined for the scikit-learn SVR implementation.
Python Implementation for SVM Classification task ( support vector classifier)
Business Case:-To find out based on given features whether the loan will get approved or not
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Load Dataset
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /loan_approved.csv')
data.head()
Output: Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status (Approved) 0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y 1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N 2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y 3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y 4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
Basic Checks
data.describe()
Output: ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History count 614.000000 614.000000 592.000000 600.00000 564.000000 mean 5403.459283 1621.245798 146.412162 342.00000 0.842199 std 6109.041673 2926.248369 85.587325 65.12041 0.364878 min 150.000000 0.000000 9.000000 12.00000 0.000000 25% 2877.500000 0.000000 100.000000 360.00000 1.000000 50% 3812.500000 1188.500000 128.000000 360.00000 1.000000 75% 5795.000000 2297.250000 168.000000 360.00000 1.000000 max 81000.000000 41667.000000 700.000000 480.00000 1.000000
data.describe(include="O")
Output: Loan_ID Gender Married Dependents Education Self_Employed Property_Area Loan_Status (Approved) count 614 601 611 599 614 582 614 614 unique 614 2 2 4 2 2 3 2 top LP001002 Male Yes 0 Graduate No Semiurban Y freq 1 489 398 345 480 500 233 422
data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 614 entries, 0 to 613 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 614 non-null object 1 Gender 601 non-null object 2 Married 611 non-null object 3 Dependents 599 non-null object 4 Education 614 non-null object 5 Self_Employed 582 non-null object 6 ApplicantIncome 614 non-null int64 7 CoapplicantIncome 614 non-null float64 8 LoanAmount 592 non-null float64 9 Loan_Amount_Term 600 non-null float64 10 Credit_History 564 non-null float64 11 Property_Area 614 non-null object 12 Loan_Status (Approved) 614 non-null object dtypes: float64(4), int64(1), object(8) memory usage: 62.5+ KB
Exploratory Data Analysis:
# Univariate Analysis by sweetviz
!pip install sweetviz
Output: Collecting sweetviz Downloading sweetviz-2.3.1-py3-none-any.whl.metadata (24 kB) -------------------------------------------------------------- Downloading sweetviz-2.3.1-py3-none-any.whl (15.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.1/15.1 MB 82.1 MB/s eta 0:00:00 Installing collected packages: sweetviz Successfully installed sweetviz-2.3.1
import sweetviz as sv
univariate_report = sv.analyze(data) ## pass the original dataframe
univariate_report.show_html()
Findings from univariate Analysis:
- Loan ID column can be dropped
- Missing value present in Gender, Married, Dependents,Self_Employed, LoanAmount,Loan_Amount_Term,
- Imbalanced data set for Gender
- Categorical data:Gender, Married,Dependents, Self_Employed
- Numerical data:ApplicantIncome, LoanAmount,
!pip install autoviz
Output: Collecting autoviz Downloading autoviz-0.1.905-py3-none-any.whl.metadata (14 kB) Requirement already satisfied: xlrd in /usr/local/lib/python3.10/dist-packages (from autoviz) (2.0.1) Requirement already satisfied: wordcloud in /usr/local/lib/python3.10/dist-packages (from autoviz) (1.9.3) Collecting emoji (from autoviz) Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB) Collecting pyamg (from autoviz)---------------------------------
from autoviz import AutoViz_Class
AV = AutoViz_Class()
bivariate_report = AV.AutoViz('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /loan_approved.csv',verbose=1)
Output: Shape of your Data Set loaded: (614, 13) ####################################################################################### ######################## C L A S S I F Y I N G V A R I A B L E S #################### ####################################################################################### Classifying variables in data set... Number of Numeric Columns = 3 Number of Integer-Categorical Columns = 1 Number of String-Categorical Columns = 2 Number of Factor-Categorical Columns = 0 Number of String-Boolean Columns = 5 Number of Numeric-Boolean Columns = 1 Number of Discrete String Columns = 0 Number of NLP String Columns = 0 Number of Date Time Columns = 0 Number of ID Columns = 1 Number of Columns to Delete = 0 13 Predictors classified... 1 variable(s) removed since they were ID or low-information variables List of variables removed: ['Loan_ID'] To fix these data quality issues in the dataset, import FixDQ from autoviz... All variables classified into correct types.
Number of All Scatter Plots = 6 All Plots done Time to run AutoViz = 5 seconds ###################### AUTO VISUALIZATION Completed ########################
dataC=data[['Gender','Married','Dependents','Education','Self_Employed','Property_Area']]
dataN=data[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']]
# For Categorical data
plt.figure(figsize=(12,10), facecolor='white')#To set canvas
plotnumber = 1#counter
for column in dataC:#accessing the columns
ax = plt.subplot(3,3,plotnumber)
sns.countplot(x=dataC[column],hue=data['Loan_Status (Approved)'])
plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
plt.ylabel('Loan Status',fontsize=10)
plotnumber+=1#counter increment
plt.tight_layout()
#For Numerical data
plt.figure(figsize=(12,10), facecolor='white')#To set canvas
plotnumber = 1#counter
for column in dataN:#accessing the columns
if plotnumber<=16 :
ax = plt.subplot(4,4,plotnumber)
sns.histplot(x=dataN[column],hue=data['Loan_Status (Approved)'])
plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
plt.ylabel('Loan Status',fontsize=10)
plotnumber+=1#counter increment
plt.tight_layout()
Data Preprocessing:
Handling Missing Value
## Checking missing values
data.isnull().sum()
print(data.isnull().sum())
Output: Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status (Approved) 0 dtype: int64
Findings:
Categorical: Gender,Married, Dependents, Self-Employed , Need to impute with mode
Numerical: LoanAmount,Loan_Amount_Term, Credit_History, Need to impute with Median as distribution is skewed
data["Gender"].mode()
Output: 0 Male Name: Gender, dtype: object
## Imputing the missing values with mode
data.loc[data['Gender'].isnull(),'Gender']='Male'
sns.countplot(x='Dependents',data=data,hue='Loan_Status (Approved)')
From the graphical representation it can be seen that as the number of dependents getting increased,the chances of approval is less, since we have missed values and if we approve loan for them it can turn into major loss(high chance).Hence we will substitue missing values by 3+.
data.Dependents.value_counts()
Output: Dependents 0 345 1 102 2 101 3+ 51 Name: count, dtype: int64
## Imputing the missing values with mode
data.loc[data['Dependents'].isnull(),'Dependents']='3+'
## getting the counts
data.Married.mode()
print(data.Married.mode())
Output: 0 Yes Name: Married, dtype: object
## Imputing with yes i.e mode
data.loc[data['Married'].isnull()==True,'Married']='Yes'
## getting the counts
data.Self_Employed.mode()
print(data.Self_Employed.mode())
Output: 0 No Name: Self_Employed, dtype: object
# Replace the nan values with mode
data.loc[data['Self_Employed'].isnull()==True,'Self_Employed']='No'
## Histogram since it has numerical value
data["LoanAmount"].hist()
plt.show()
Since data is skewed, we can use median to replace the nan value. It is recommended to use mean only for symmetric data distribution.
data["LoanAmount"].median()
Output: 128.0
# Replace the nan values in LoanAmount column with median value
data.loc[data['LoanAmount'].isnull()==True,'LoanAmount']=data["LoanAmount"].median()
data.Loan_Amount_Term.hist()
# replace the nan values in Loan_Amount_Term with the median value
data.loc[data['Loan_Amount_Term'].isnull()==True,'Loan_Amount_Term']=np.median(data.Loan_Amount_Term.dropna(axis=0))
# Credit_History
data.Credit_History.value_counts()
print(data.Credit_History.value_counts())
Output: Credit_History 1.0 475 0.0 89 Name: count, dtype: int64
#Although mode is 1 , these are missing value if these were 0 by any chance our prediction will be wrong & wrong people might received loan, so we we use 0
data.loc[data['Credit_History'].isnull()==True,'Credit_History']=0.0
data.isnull().sum()
Output: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status (Approved) 0 dtype: int64
Handling Categorical Data
data.info()
Categorical Data:
- Gender ,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status (Approved)
## Using label encoder to convert the categorical data to numerical data
## Donot run this code.This is just implementation of label encoder.This dataset have lots relationship with target.
from sklearn.preprocessing import LabelEncoder
lc=LabelEncoder()
data.Gender=lc.fit_transform(data.Gender)
data.Married=lc.fit_transform(data.Married)
data.Education=lc.fit_transform(data.Education)
data.Property_Area=lc.fit_transform(data.Property_Area)
data['Loan_Status (Approved)']=lc.fit_transform(data['Loan_Status (Approved)'])
data.Dependents=lc.fit_transform(data.Dependents)
data.Self_Employed=lc.fit_transform(data.Self_Employed)
## One hot encoding
df1=pd.get_dummies(data['Gender'],prefix='Gender',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Gender'],axis=1)
df1=pd.get_dummies(data['Married'],prefix='Married',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Married'],axis=1)
df1=pd.get_dummies(data['Education'],prefix='Education',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Education'],axis=1)
df1=pd.get_dummies(data['Property_Area'],prefix='Property_Area',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Property_Area'],axis=1)
df1=pd.get_dummies(data['Dependents'],prefix='Dependents',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Dependents'],axis=1)
df1=pd.get_dummies(data['Self_Employed'],prefix='Self_Employed',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Self_Employed'],axis=1)
Scaling down data
from sklearn.preprocessing import MinMaxScaler
scale=MinMaxScaler()
data[['ApplicantIncome','CoapplicantIncome','LoanAmount']]=scale.fit_transform(data[['ApplicantIncome','oapplicantIncome',
'LoanAmount']])
Checking the duplicate rows
data.duplicated().sum()
Output: 0
Saving the preprocessed data
data.to_csv('Preprocessed_data.csv')
## Loading the data
preprcessed_data=pd.read_csv('Preprocessed_data.csv')
Output: Unnamed: 0 Loan_ID ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Loan_Status (Approved) Gender_1 Married_1 Education_1 Property_Area_1 Property_Area_2 Dependents_1 Dependents_2 Dependents_3 Self_Employed_1 0 0 LP001002 0.070489 0.000000 0.172214 360.0 1.0 1 True False False False True False False False False 1 1 LP001003 0.054830 0.036192 0.172214 360.0 1.0 0 True True False False False True False False False 2 2 LP001005 0.035250 0.000000 0.082489 360.0 1.0 1 True True False False True False False False True 3 3 LP001006 0.030093 0.056592 0.160637 360.0 1.0 1 True True True False True False False False False 4 4 LP001008 0.072356 0.000000 0.191027 360.0 1.0 1 True False False False True False False False False
Feature Selection
# Removing redundant columns
#We can drop loan id.
l1=['Unnamed: 0','Loan_ID']
preprcessed_data.drop(l1,axis=1,inplace=True)
## checking correlation
corr_data=preprcessed_data[['ApplicantIncome','CoapplicantIncome','LoanAmount']]
corr_data.describe() ## no constant features
Output: ApplicantIncome CoapplicantIncome LoanAmount count 614.000000 614.000000 614.000000 mean 0.064978 0.038910 0.197905 std 0.075560 0.070229 0.121718 min 0.000000 0.000000 0.000000 25% 0.033735 0.000000 0.132055 50% 0.045300 0.028524 0.172214 75% 0.069821 0.055134 0.225398 max 1.000000 1.000000 1.000000
Model Creation:
preprcessed_data.keys()
Output: Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Loan_Status (Approved)', 'Gender_1', 'Married_1', 'Education_1', 'Property_Area_1', 'Property_Area_2', 'Dependents_1', 'Dependents_2', 'Dependents_3', 'Self_Employed_1'], dtype='object')
# Creating independent & dependent variable
X = preprcessed_data.loc[:,['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History',
'Gender_1', 'Married_1', 'Education_1',
'Property_Area_1', 'Property_Area_2', 'Dependents_1', 'Dependents_2',
'Dependents_3', 'Self_Employed_1']]
y = preprcessed_data['Loan_Status (Approved)']
# Creating Training & Testing Data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 24)
Balancing data
# Install imblearn package - pip install imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X_train,y_train)
from collections import Counter
print("Actual Classes",Counter(y_train))
print("SMOTE Classes",Counter(y_smote))
Actual Classes Counter({1: 309, 0: 151})
SMOTE Classes Counter({1: 309, 0: 309})
Note : Counter is a container which keeps track to how many times equivalent values are added. Python counter class is a part of collections module and is a subclass of dictionary.
# Creating Model
from sklearn.svm import SVC
model = SVC()
model.fit(X_smote,y_smote)
y_predict = model.predict(X_test)
Evaluation
from sklearn.metrics import accuracy_score,recall_score,precision_score,classification_report, f1_score
accuracy_score(y_test,y_predict)
Output: 0.7272727272727273
print(classification_report(y_test,y_predict))
Output: precision recall f1-score support 0 0.33 0.02 0.05 41 1 0.74 0.98 0.84 113 accuracy 0.73 154 macro avg 0.53 0.50 0.44 154 weighted avg 0.63 0.73 0.63 154
pd.crosstab(y_test,y_predict)
print(pd.crosstab(y_test,y_predict))
Output: col_0 0 1 Loan_Status (Approved) 0 1 40 1 2 111
f1_score(y_test,y_predict)
Output: 0.8409090909090909
Cross Validation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model,X,y,cv=3,scoring='f1')
print("Cross validation Score:",scores.mean())
print("Std :",scores.std())
#std of < 0.05 is good.
Output: Cross validation Score: 0.8146704306134337 Std : 0.0005069547205710685
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Defining Parameter range
param_grid = {'C' : [0.1,5,10,50,60,70],
'gamma' : [1,0.1,0.01,0.001,0.0001],
'random_state':(list(range(1,20)))}
model2 = SVC()
grid = GridSearchCV(model2,param_grid,refit = True, verbose = 2,scoring = 'f1',cv=5)
grid.fit(X,y)
grid.best_params_
Output: {'C': 5, 'gamma': 0.1, 'random_state': 1}
# Creating Model
from sklearn.svm import SVC
model3 = SVC(C=5, gamma=0.1,random_state=1)
model3.fit(X_smote,y_smote)
y_predict3 = model.predict(X_test)
print(classification_report(y_test,y_predict3))
Output: precision recall f1-score support 0 0.33 0.02 0.05 41 1 0.74 0.98 0.84 113 accuracy 0.73 154 macro avg 0.53 0.50 0.44 154 weighted avg 0.63 0.73 0.63 154
print(pd.crosstab(y_test,y_predict3))
Output: col_0 0 1 Loan_Status (Approved) 0 1 40 1 2 111
print(f1_score(y_test,y_predict3))
Output: 0.8409090909090909
from sklearn.model_selection import cross_val_score
scores2 = cross_val_score(model3,X,y,cv=3,scoring='f1')
print("Cross validation Score:",scores2.mean())
print("Std :",scores2.std())
#std of < 0.05 is good.
Output: Cross validation Score: 0.8483770612898426 Std : 0.009742740208838268