Logistic Regression:
- Logistic regression is a supervised machine learning algorithm used for classification tasks, aiming to predict the probability that an instance belongs to a specific class.
- If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class(called the positive class, labeled as ‘1’), or else it predicts that it does not (that is it belongs to the negative class, labeled as ‘o’ ).
- This makes it a binary classifier.
Logistic Regression Equation:
In linear regression, we predict a continuous outcome using:
y=β0+β1x1+⋯+βnxn
But for classification, we need to predict probabilities between 0 and 1, not arbitrary real numbers. So we apply the logistic function (also called the sigmoid function) to “squash” the linear output.The sigmoid function is used to map the linear output of the regression model into a probability value, which is constrained between 0 and 1.
The sigmoid function has the following formula:
To make a decision:
If , predict 1
If , predict 0
The threshold (0.5) can be tuned based on context (e.g., in imbalanced datasets).
Cost Function:
A cost function tells us how wrong our model is — it’s like a scorecard:
High cost = Bad predictions
Low cost = Good predictions
In logistic regression, we’re predicting probabilities (like 0.82 or 0.09), but the actual outcome is either 1 (yes) or 0 (no).
So we need a cost function that:
- Punishes bad predictions
- Rewards the correct ones
- Works well with probabilities
We use a function called Log Loss or Binary Cross-Entropy for Logistic Regression.
Let’s break it down:
If y=1: The cost becomes −log(y^) → We want y^ to be close to 1
If y=0: The cost becomes −log(1−y^) → We want y^ to be close to 0
So the cost gets really big if the model predicts the wrong class confidently.
To make the model better, we want to reduce the cost function (get better predictions).
To reduce the cost function, we will use Gradient Descent.
Gradient Descent
Imagine we are walking down a hill in fog to find the lowest point (lowest error):
Take a step in the direction of the steepest slope (gradient).
Keep stepping until the slope flattens we are at the minimum.
In logistic regression, this means updating the model’s parameters (β) like this:
The computer does this automatically. We just need to set the learning rate and let it iterate.
How Model Works:
Model starts with random guesses.
Cost is high at first.
Each step via gradient descent makes the predictions better.
Cost keeps going down.
Training stops when cost is low enough or doesn’t improve.
Python Implementation for Logistic Regression:
Business Case: Based on the given features, predict whether a person will have diabetes or not.
- Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
- Load the datasets
data=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv')
Basic Checks:
- See first five rows
data.head()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1
- See the last five rows
data.tail()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 763 10 101 76 48 180 32.9 0.171 63 0 764 2 122 70 27 0 36.8 0.340 27 0 765 5 121 72 23 112 26.2 0.245 30 0 766 1 126 60 0 0 30.1 0.349 47 1 767 1 93 70 31 0 30.4 0.315 23 0
data.describe()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
- In ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, and ‘BMI’ certain datapoints are zero although these parameter value cannot be zero, so definitely these are corrupted data
data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
Exploratory Data Analysis
sns.countplot(x='Pregnancies',data=data)
# Maximum patients have conceived 1 and 0 times.
plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1
for column in data:
if plotnumber<=9:
ax=plt.subplot(3,3,plotnumber)
sns.histplot(data[column])
plt.xlabel(column,fontsize=8)
plt.ylabel('Count',fontsize=8)
plotnumber+=1
plt.tight_layout()
## Analyzing how preganancies impact the patient with diabetes.
sns.countplot(x='Pregnancies',hue='Outcome',data=data)
plt.show()
## Aanlyzing the relationship between diabetes and Glucose
sns.histplot(x='Glucose',hue='Outcome',data=data)
## Analyze Glucose with blood pressure
sns.relplot(x='Glucose',y='BloodPressure',hue='Outcome',data=data)
plt.show()
## Analyze Glucose with SkinThickness
sns.relplot(x='Glucose',y='SkinThickness',hue='Outcome',data=data)
plt.show()
## Analyze relationship between BloodPressure and Outcome
sns.histplot(x='BloodPressure',hue='Outcome',data=data)
## Analyze BP with SkinThickness
sns.relplot(x='BloodPressure',y='SkinThickness',hue='Outcome',data=data)
plt.show()
## Analyze BP with Insulin
sns.relplot(x='BloodPressure',y='Insulin',col='Outcome',data=data)
plt.show()
## Analyzing Insulin with target
sns.histplot(x='Insulin',hue='Outcome',data=data)
Data Preprocessing and Feature Engineering:
- Handling the missing values
data.isnull().sum()
Output: Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
- Handling the corrupted data.
Our corrupted variables are ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’,’BMI’ as they have certain datapoints are zero.
#Locate datasets where Glucose data corrupted
data.loc[data['Glucose']==0]
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 75 1 0 48 20 0 24.7 0.140 22 0 182 1 0 74 20 23 27.7 0.299 21 0 342 1 0 68 35 0 32.0 0.389 22 0 349 5 0 80 32 0 41.0 0.346 37 1 502 6 0 68 41 0 39.0 0.727 41 1
# Replace zero value with mean
data.Glucose.replace(0,np.mean(data.Glucose),inplace=True)
#Check if zero is still available
data.loc[data['Glucose']==0]
Output: Empty DataFrame Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome] Index: []
# Similar way impute value for rest variables
data.BloodPressure.replace(0,np.mean(data.BloodPressure),inplace=True)
data.SkinThickness.replace(0,np.median(data.SkinThickness),inplace=True)
data.Insulin.replace(0,np.median(data.Insulin),inplace=True)
data.BMI.replace(0,np.mean(data.BMI),inplace=True)
- Checking the outliers
plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1
for column in data:
if plotnumber<=9:
ax=plt.subplot(3,3,plotnumber)
sns.boxplot(data[column])
plt.xlabel(column,fontsize=8)
plt.ylabel('Count',fontsize=8)
plotnumber+=1
plt.tight_layout()
- Scaling the data
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler()
dl=['Pregnancies','Outcome'] # We do not need to scale these two variables as they already is smaller magnitude
data1=sc.fit_transform(data.drop(dl,axis=1))
data2=pd.DataFrame(data1,columns=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
Small_Magnitude_data=data[['Pregnancies','Outcome']]
final_df=pd.concat([data2,Small_Magnitude_data],axis=1)
final_df
Output: Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Pregnancies Outcome 0 0.670968 0.489796 0.304348 0.019832 0.314928 0.234415 0.483333 6 1 1 0.264516 0.428571 0.239130 0.019832 0.171779 0.116567 0.166667 1 0 2 0.896774 0.408163 0.173913 0.019832 0.104294 0.253629 0.183333 8 1 3 0.290323 0.428571 0.173913 0.096154 0.202454 0.038002 0.000000 1 0 4 0.600000 0.163265 0.304348 0.185096 0.509202 0.943638 0.200000 0 1
Feature Selection
## No redundant fetaures
## We will check correlation
sns.heatmap(data2.corr(),annot=True)
# So no correlation hence no features should be
Model Creation
- Creating independent and dependent variable.
X=final_df.iloc[:,:-1]
y=final_df.Outcome
- Creating training and testing data.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=45)
- Model creation
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train) ## training
- Prediction
y_pred=clf.predict(X_test)
y_pred
Output: array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0])
y_pred_prob=clf.predict_proba(X_test)
y_pred_prob
Output: array([[0.74189213, 0.25810787], [0.54002809, 0.45997191], [0.81116822, 0.18883178], [0.85453471, 0.14546529], [0.54602547, 0.45397453], [0.92399497, 0.07600503], [0.36275549, 0.63724451], [0.91507154, 0.08492846. ........................
data.Outcome.value_counts()
Output: 0 500 1 268 Name: Outcome, dtype: int64
Model Evaluation:
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)
recall=recall_score(y_test,y_pred)
print("\nrecall",recall)
precision=precision_score(y_test,y_pred)
print("\nprecision",precision)
f1score=f1_score(y_test,y_pred)
print("\nf1score",f1score)
cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output: confusion_matrix [[113 17] [ 31 31]] recall 0.5 precision 0.6458333333333334 f1score 0.5636363636363636 classification_report precision recall f1-score support 0 0.78 0.87 0.82 130 1 0.65 0.50 0.56 62 accuracy 0.75 192 macro avg 0.72 0.68 0.69 192 weighted avg 0.74 0.75 0.74 192
Multiclass Classification
Problem statement:-Based on features like sepal,petal length and width,predict the species of iris flower
#Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
df
Output: SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Note: Let’s skip data preprocessing & feature engineering, go to model building straight way
Model Creation:
X=df.iloc[:,:-1]
y=df.Name
## Training and testing data
from sklearn.model_selection import train_test_split
X_train1,X_test1,y_train,y_test=train_test_split(X,y,random_state=25)
#Building model
from sklearn.linear_model import LogisticRegression
lr_multi=LogisticRegression(multi_class='ovr')
lr_multi.fit(X_train1,y_train)
y_pred=lr_multi.predict(X_test1)
Note: “ovr” stands for “One-vs-Rest” (also known as “One-vs-All”). This is a strategy used for multi-class classification.
Model Evaluation:
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)
recall=recall_score(y_test,y_pred,average='weighted')
print("\nrecall",recall)
precision=precision_score(y_test,y_pred,average='weighted')
print("\nprecision",precision)
f1score=f1_score(y_test,y_pred,average='weighted')
print("\nf1score",f1score)
cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output: confusion_matrix [[113 17] [ 31 31]] recall 0.75 precision 0.7398726851851851 f1score 0.7404777704047777 classification_report precision recall f1-score support 0 0.78 0.87 0.82 130 1 0.65 0.50 0.56 62 accuracy 0.75 192 macro avg 0.72 0.68 0.69 192 weighted avg 0.74 0.75 0.74 192