Support Vector Machine (SVM):

  • Supervised Machine Learning algorithm used for both classification and regression.

  • It tries to divide the data using hyperplanes and then makes the predictions.

  • It is a non-probabilistic linear classifier.

  • While other classifiers classify & predict the probability of a data point to belong to one group or another, SVM directly says to which group the data point belongs to without using any probability calculation.

How it works for classification:

Hyperplane:

  • SVM constructs a best line or the decision boundary called Hyperplane

  • It can be used for classification or regression or outlier detection.

  • The dimension of the hyperplane depends upon the number of features.

  • If the number of features is 2,then the hyperplane is just a line.

  • If the number of features is 3 ,then the hyperplane becomes a two-dimensional plane.

Marginal distance:

  • This hyperplane creates 2 margin lines parallel to it which have some distance so that it can distinctly classify the data points.

  • The distance between the 2 margin lines are called marginal distance

Support Vectors:

  • These 2 margin lines passes through the most nearest (+ve) points and the most nearest (-ve) points.

  • Those points through which the margin lines pass are called support vectors.

  • It helps to determine the maximum distance of the marginal plane.

Mathematical Intuition:

  • In image below, there are two groups of data.

  • To divide these points into two groups, we can use three lines.

  • There can be infinite number of straight lines that can divide these points into two classes.

    Now which line to choose?

    SVM solves this problem using the maximum margin as shown below

 

  • The Black Line in the middle is the optimum classifier.

  • This line is drawn to maximize the distance of the classifier line from the nearest points in the two classes. It is called as hyperplane in terms of SVM.

  • Hyperplane is an n-dimensional plane that optimally divides the data of n dimensions only.

  • Here, as we have 2-D data, so the hyperplane can be represented using one dimension only. Hence, hyperplane is a line here.

  • The two points (highlighted with circles) which are on the yellow lines, they are called the support vectors.

  • As it 2-D figure, they are points. In multidimensional space, they will be vectors

  • From the name “support vector machine”, this algorithm creates the optimum classification line by maximizing its distance from the two support vectors.

For Higher Dimension

  • When the data is not linearly separable, then to create a hyperplane to separate data into different groups,

  • The SVM algorithm needs to perform computations in higher-dimensional space.

  • But the introduction of new dimensions makes the computations for the SVMs more intensive, which impacts the algorithm performance.

  • To rectify this, mathematicians came up with the approach of Kernel methods.

  • It uses Kernel functions.

  • The unique feature of a kernel function is to compute in a higher-dimensional space without calculating the new coordinates in that higher dimension.

  • It implicitly uses predefined mathematical functions to do operations on the existing points which mimic the computation in a higher-dimensional space without adding to the computation cost as they are not actually calculating the coordinates in the higher dimension thereby avoiding the computation of calculating distances from the newly computed points.

  • This is called the kernel trick.

  • In the left diagram below, we have a non-linear distribution of data as we can not classify data using a linear equation.

  • To solve this problem, we can project the points in a 3-dimensional space and then derive a plane that divides the data into two parts.

  • In theory, that’s what a kernel function does without computing the additional coordinates for the higher dimension.

How it works for Regression:

  • For Regression, determine the best fit line.

  • The idea is to create a line which minimises the total residual error.

  • The SVR approach is a bit different.

  • Instead of trying to minimise the error, SVR focuses on keeping the error in a fixed range.

  • This approach can be explained using three lines.

  • The first line is the best fit regressor line, and the other two lines are the bordering lines which denote the range of error.

  • It means that we are going to consider the points inside this ± error boundary only for preparing our model.

  • In other words, the best fit line(or the hyperplane) will be the line which goes through the maximum number of data points and the error boundaries are chosen to ensure maximum inclusion.

  • This error term can be customized using the epsilon parameter defined for the scikit-learn SVR implementation.

Python Implementation for SVM Classification task ( support vector classifier)

Business Case:-To find out based on given features whether the loan will get approved or not

# importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Load Dataset

data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /loan_approved.csv')

data.head()
Output:
  Loan_ID  Gender Married Dependents   Education   Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History Property_Area Loan_Status (Approved)
0  LP001002  Male      No       0          Graduate        No           5849                0.0             NaN          360.0             1.0           Urban               Y          
1  LP001003  Male     Yes       1          Graduate        No           4583             1508.0           128.0          360.0             1.0           Rural               N          
2  LP001005  Male     Yes       0          Graduate       Yes           3000                0.0            66.0          360.0             1.0           Urban               Y          
3  LP001006  Male     Yes       0      Not Graduate        No           2583             2358.0           120.0          360.0             1.0           Urban               Y          
4  LP001008  Male      No       0          Graduate        No           6000                0.0           141.0          360.0             1.0           Urban               Y          

Basic Checks

data.describe()
Output:
     
ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History
count     614.000000        614.000000     592.000000      600.00000       564.000000  
mean     5403.459283       1621.245798     146.412162      342.00000         0.842199  
std      6109.041673       2926.248369      85.587325       65.12041         0.364878  
min       150.000000          0.000000       9.000000       12.00000         0.000000  
25%      2877.500000          0.000000     100.000000      360.00000         1.000000  
50%      3812.500000       1188.500000     128.000000      360.00000         1.000000  
75%      5795.000000       2297.250000     168.000000      360.00000         1.000000  
max     81000.000000      41667.000000     700.000000      480.00000         1.000000  
data.describe(include="O")
Output:
      Loan_ID  Gender Married Dependents Education Self_Employed Property_Area Loan_Status (Approved)
count        614   601     611      599          614       582              614            614         
unique       614     2       2        4            2         2                3              2         
top     LP001002  Male     Yes        0     Graduate        No        Semiurban              Y         
freq           1   489     398      345          480       500              233            422         
data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Loan_ID                 614 non-null    object 
 1   Gender                  601 non-null    object 
 2   Married                 611 non-null    object 
 3   Dependents              599 non-null    object 
 4   Education               614 non-null    object 
 5   Self_Employed           582 non-null    object 
 6   ApplicantIncome         614 non-null    int64  
 7   CoapplicantIncome       614 non-null    float64
 8   LoanAmount              592 non-null    float64
 9   Loan_Amount_Term        600 non-null    float64
 10  Credit_History          564 non-null    float64
 11  Property_Area           614 non-null    object 
 12  Loan_Status (Approved)  614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Exploratory Data Analysis:

# Univariate Analysis by sweetviz

!pip install sweetviz
Output:
Collecting sweetviz
  Downloading sweetviz-2.3.1-py3-none-any.whl.metadata (24 kB)

--------------------------------------------------------------
Downloading sweetviz-2.3.1-py3-none-any.whl (15.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.1/15.1 MB 82.1 MB/s eta 0:00:00
Installing collected packages: sweetviz
Successfully installed sweetviz-2.3.1
import sweetviz as sv

univariate_report = sv.analyze(data) ## pass the original dataframe

univariate_report.show_html()

Findings from univariate Analysis:

  • Loan ID column can be dropped
  • Missing value present in Gender, Married, Dependents,Self_Employed, LoanAmount,Loan_Amount_Term,
  • Imbalanced data set for Gender
  • Categorical data:Gender, Married,Dependents, Self_Employed
  • Numerical data:ApplicantIncome, LoanAmount,
# Bivariate Analysis by Autoviz
!pip install autoviz
Output:
Collecting autoviz
  Downloading autoviz-0.1.905-py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: xlrd in /usr/local/lib/python3.10/dist-packages (from autoviz) (2.0.1)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.10/dist-packages (from autoviz) (1.9.3)
Collecting emoji (from autoviz)
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Collecting pyamg (from autoviz)---------------------------------
from autoviz import AutoViz_Class
AV = AutoViz_Class()

bivariate_report = AV.AutoViz('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /loan_approved.csv',verbose=1)
Output:
Shape of your Data Set loaded: (614, 13)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  3
    Number of Integer-Categorical Columns =  1
    Number of String-Categorical Columns =  2
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  5
    Number of Numeric-Boolean Columns =  1
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  1
    Number of Columns to Delete =  0
    13 Predictors classified...
        1 variable(s) removed since they were ID or low-information variables
        List of variables removed: ['Loan_ID']
To fix these data quality issues in the dataset, import FixDQ from autoviz...
    All variables classified into correct types.

Number of All Scatter Plots = 6 All Plots done Time to run AutoViz = 5 seconds ###################### AUTO VISUALIZATION Completed ########################

# Bivariate analysis manual process
dataC=data[['Gender','Married','Dependents','Education','Self_Employed','Property_Area']]
dataN=data[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']]
# For Categorical data

plt.figure(figsize=(12,10), facecolor='white')#To set canvas
plotnumber = 1#counter

for column in dataC:#accessing the columns
    ax = plt.subplot(3,3,plotnumber)
    sns.countplot(x=dataC[column],hue=data['Loan_Status (Approved)'])
    plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
    plt.ylabel('Loan Status',fontsize=10)
    plotnumber+=1#counter increment
plt.tight_layout()

#For Numerical data

plt.figure(figsize=(12,10), facecolor='white')#To set canvas
plotnumber = 1#counter

for column in dataN:#accessing the columns
    if plotnumber<=16 :
        ax = plt.subplot(4,4,plotnumber)
        sns.histplot(x=dataN[column],hue=data['Loan_Status (Approved)'])
        plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
        plt.ylabel('Loan Status',fontsize=10)
    plotnumber+=1#counter increment
plt.tight_layout()

Data Preprocessing:

Handling Missing Value

## Checking missing values
data.isnull().sum()

print(data.isnull().sum())
Output:
Loan_ID                    0
Gender                    13
Married                    3
Dependents                15
Education                  0
Self_Employed             32
ApplicantIncome            0
CoapplicantIncome          0
LoanAmount                22
Loan_Amount_Term          14
Credit_History            50
Property_Area              0
Loan_Status (Approved)     0
dtype: int64

Findings:

  1. Categorical: Gender,Married, Dependents, Self-Employed , Need to impute with mode

  2. Numerical: LoanAmount,Loan_Amount_Term, Credit_History, Need to impute with Median as distribution is skewed

data["Gender"].mode()
Output:
0    Male
Name: Gender, dtype: object
## Imputing the missing values with mode
data.loc[data['Gender'].isnull(),'Gender']='Male'
sns.countplot(x='Dependents',data=data,hue='Loan_Status (Approved)')

From the graphical representation it can be seen that as the number of dependents getting increased,the chances of approval is less, since we have missed values and if we approve loan for them it can turn into major loss(high chance).Hence we will substitue missing values by 3+.

data.Dependents.value_counts()
Output:
Dependents
0     345
1     102
2     101
3+     51
Name: count, dtype: int64
## Imputing the missing values with mode
data.loc[data['Dependents'].isnull(),'Dependents']='3+'
## getting the counts
data.Married.mode()

print(data.Married.mode())
Output:
0    Yes
Name: Married, dtype: object
## Imputing with yes i.e mode
data.loc[data['Married'].isnull()==True,'Married']='Yes'
## getting the counts
data.Self_Employed.mode()

print(data.Self_Employed.mode())
Output:
0    No
Name: Self_Employed, dtype: object
# Replace the nan values with mode
data.loc[data['Self_Employed'].isnull()==True,'Self_Employed']='No'
## Histogram since it has numerical value
data["LoanAmount"].hist()
plt.show()

Since data is skewed, we can use median to replace the nan value. It is recommended to use mean only for symmetric data distribution.

data["LoanAmount"].median()
Output:
128.0
# Replace the nan values in LoanAmount column with median value
data.loc[data['LoanAmount'].isnull()==True,'LoanAmount']=data["LoanAmount"].median()
data.Loan_Amount_Term.hist()

# replace the nan values in Loan_Amount_Term with the median value
data.loc[data['Loan_Amount_Term'].isnull()==True,'Loan_Amount_Term']=np.median(data.Loan_Amount_Term.dropna(axis=0))
# Credit_History
data.Credit_History.value_counts()

print(data.Credit_History.value_counts())
Output:
Credit_History
1.0    475
0.0     89
Name: count, dtype: int64
#Although mode is 1 , these are missing value if these were 0 by any chance our prediction will be wrong & wrong people might received loan, so we we use 0

data.loc[data['Credit_History'].isnull()==True,'Credit_History']=0.0
data.isnull().sum()
Output:
Loan_ID                   0
Gender                    0
Married                   0
Dependents                0
Education                 0
Self_Employed             0
ApplicantIncome           0
CoapplicantIncome         0
LoanAmount                0
Loan_Amount_Term          0
Credit_History            0
Property_Area             0
Loan_Status (Approved)    0
dtype: int64

Handling Categorical Data

data.info()

Categorical Data:

  • Gender ,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status (Approved)
## Using label encoder to convert the categorical data to numerical data
## Donot run this code.This is just implementation of label encoder.This dataset have lots relationship with target.
from sklearn.preprocessing import LabelEncoder
lc=LabelEncoder()

data.Gender=lc.fit_transform(data.Gender)
data.Married=lc.fit_transform(data.Married)
data.Education=lc.fit_transform(data.Education)
data.Property_Area=lc.fit_transform(data.Property_Area)
data['Loan_Status (Approved)']=lc.fit_transform(data['Loan_Status (Approved)'])
data.Dependents=lc.fit_transform(data.Dependents)
data.Self_Employed=lc.fit_transform(data.Self_Employed)
## One hot encoding
df1=pd.get_dummies(data['Gender'],prefix='Gender',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Gender'],axis=1)

df1=pd.get_dummies(data['Married'],prefix='Married',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Married'],axis=1)

df1=pd.get_dummies(data['Education'],prefix='Education',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Education'],axis=1)

df1=pd.get_dummies(data['Property_Area'],prefix='Property_Area',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Property_Area'],axis=1)

df1=pd.get_dummies(data['Dependents'],prefix='Dependents',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Dependents'],axis=1)

df1=pd.get_dummies(data['Self_Employed'],prefix='Self_Employed',drop_first=True)
data=pd.concat([data,df1],axis=1).drop(['Self_Employed'],axis=1)

Scaling down data

from sklearn.preprocessing import MinMaxScaler
scale=MinMaxScaler()
data[['ApplicantIncome','CoapplicantIncome','LoanAmount']]=scale.fit_transform(data[['ApplicantIncome','oapplicantIncome',
                          'LoanAmount']])

Checking the duplicate rows

data.duplicated().sum()
Output:
0

Saving the preprocessed data

data.to_csv('Preprocessed_data.csv')
## Loading the data
preprcessed_data=pd.read_csv('Preprocessed_data.csv')
Output:
  Unnamed: 0  Loan_ID   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  Loan_Status (Approved)  Gender_1  Married_1  Education_1  Property_Area_1  Property_Area_2  Dependents_1  Dependents_2  Dependents_3  Self_Employed_1
0       0      LP001002     0.070489          0.000000        0.172214         360.0             1.0                  1              True      False       False          False             True           False         False         False          False     
1       1      LP001003     0.054830          0.036192        0.172214         360.0             1.0                  0              True       True       False          False            False            True         False         False          False     
2       2      LP001005     0.035250          0.000000        0.082489         360.0             1.0                  1              True       True       False          False             True           False         False         False           True     
3       3      LP001006     0.030093          0.056592        0.160637         360.0             1.0                  1              True       True        True          False             True           False         False         False          False     
4       4      LP001008     0.072356          0.000000        0.191027         360.0             1.0                  1              True      False       False          False             True           False         False         False          False    

Feature Selection

# Removing redundant columns
#We can drop loan id.
l1=['Unnamed: 0','Loan_ID']
preprcessed_data.drop(l1,axis=1,inplace=True)
## checking correlation
corr_data=preprcessed_data[['ApplicantIncome','CoapplicantIncome','LoanAmount']]
## There is no relationship among the numerical data
corr_data.describe() ## no constant features
Output:
     ApplicantIncome  CoapplicantIncome  LoanAmount
count    614.000000        614.000000      614.000000
mean       0.064978          0.038910        0.197905
std        0.075560          0.070229        0.121718
min        0.000000          0.000000        0.000000
25%        0.033735          0.000000        0.132055
50%        0.045300          0.028524        0.172214
75%        0.069821          0.055134        0.225398
max        1.000000          1.000000        1.000000

Model Creation:

preprcessed_data.keys()
Output:
Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Loan_Status (Approved)',
       'Gender_1', 'Married_1', 'Education_1', 'Property_Area_1',
       'Property_Area_2', 'Dependents_1', 'Dependents_2', 'Dependents_3',
       'Self_Employed_1'],
      dtype='object')
# Creating independent & dependent variable

X = preprcessed_data.loc[:,['ApplicantIncome', 'CoapplicantIncome',
       'LoanAmount', 'Loan_Amount_Term', 'Credit_History',
       'Gender_1', 'Married_1', 'Education_1',
       'Property_Area_1', 'Property_Area_2', 'Dependents_1', 'Dependents_2',
       'Dependents_3', 'Self_Employed_1']]

y = preprcessed_data['Loan_Status (Approved)']
# Creating Training & Testing Data

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 24)

Balancing data

# Install imblearn package - pip install imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X_train,y_train)

from collections import Counter
print("Actual Classes",Counter(y_train))
print("SMOTE Classes",Counter(y_smote))
Actual Classes Counter({1: 309, 0: 151})
SMOTE Classes Counter({1: 309, 0: 309})

Note : Counter is a container which keeps track to how many times equivalent values are added. Python counter class is a part of collections module and is a subclass of dictionary.

# Creating Model

from sklearn.svm import SVC
model = SVC()
model.fit(X_smote,y_smote)

y_predict = model.predict(X_test)

Evaluation

from sklearn.metrics import accuracy_score,recall_score,precision_score,classification_report, f1_score

accuracy_score(y_test,y_predict)
Output:
0.7272727272727273
print(classification_report(y_test,y_predict))
Output:
              precision    recall  f1-score   support

           0       0.33      0.02      0.05        41
           1       0.74      0.98      0.84       113

    accuracy                           0.73       154
   macro avg       0.53      0.50      0.44       154
weighted avg       0.63      0.73      0.63       154
pd.crosstab(y_test,y_predict)

print(pd.crosstab(y_test,y_predict))
Output:
col_0                   0   1 
Loan_Status (Approved)        
0                       1   40
1                       2  111
f1_score(y_test,y_predict)
Output:
0.8409090909090909

Cross Validation:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model,X,y,cv=3,scoring='f1')

print("Cross validation Score:",scores.mean())
print("Std :",scores.std())
#std of < 0.05 is good.
Output:
Cross validation Score: 0.8146704306134337
Std : 0.0005069547205710685

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Defining Parameter range
param_grid = {'C' : [0.1,5,10,50,60,70],
              'gamma' : [1,0.1,0.01,0.001,0.0001],
              'random_state':(list(range(1,20)))}

model2  = SVC()
grid = GridSearchCV(model2,param_grid,refit = True, verbose = 2,scoring = 'f1',cv=5)

grid.fit(X,y)

grid.best_params_
Output:
{'C': 5, 'gamma': 0.1, 'random_state': 1}
# Creating Model

from sklearn.svm import SVC
model3 = SVC(C=5, gamma=0.1,random_state=1)
model3.fit(X_smote,y_smote)

y_predict3 = model.predict(X_test)
print(classification_report(y_test,y_predict3))
Output:
             precision    recall  f1-score   support

           0       0.33      0.02      0.05        41
           1       0.74      0.98      0.84       113

    accuracy                           0.73       154
   macro avg       0.53      0.50      0.44       154
weighted avg       0.63      0.73      0.63       154
print(pd.crosstab(y_test,y_predict3))
Output:
col_0                   0    1
Loan_Status (Approved)        
0                       1   40
1                       2  111
print(f1_score(y_test,y_predict3))
Output:
0.8409090909090909
from sklearn.model_selection import cross_val_score

scores2 = cross_val_score(model3,X,y,cv=3,scoring='f1')

print("Cross validation Score:",scores2.mean())
print("Std :",scores2.std())
#std of < 0.05 is good.
Output:
Cross validation Score: 0.8483770612898426
Std : 0.009742740208838268

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies