Logistic Regression

Logistic Regression:

Logistic regression is a supervised machine learning algorithm used for classification tasks, aiming to predict the probability that an instance belongs to a specific class.
If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class(called the positive class, labeled as ‘1’), or else it predicts that it does not (that is it belongs to the negative class, labeled as ‘o’ ).
This makes it a binary classifier.

Logistic Regression Equation:

In linear regression, we predict a continuous outcome using:

y=β₀+β₁x₁+⋯+β_nx_n

But for classification, we need to predict probabilities between 0 and 1, not arbitrary real numbers. So we apply the logistic function (also called the sigmoid function) to “squash” the linear output.The sigmoid function is used to map the linear output of the regression model into a probability value, which is constrained between 0 and 1.

The sigmoid function has the following formula:

To make a decision:

If $y^\geq 0.5$ , predict 1
If $y^< 0.5$ , predict 0

The threshold (0.5) can be tuned based on context (e.g., in imbalanced datasets).

Cost Function:

A cost function tells us how wrong our model is — it’s like a scorecard:

High cost = Bad predictions
Low cost = Good predictions

In logistic regression, we’re predicting probabilities (like 0.82 or 0.09), but the actual outcome is either 1 (yes) or 0 (no).

So we need a cost function that:

Punishes bad predictions
Rewards the correct ones
Works well with probabilities

We use a function called Log Loss or Binary Cross-Entropy for Logistic Regression.

$Cost= - y log (y^) - (1 - y) log (1 - y^)$

Let’s break it down:

If : The cost becomes $−log⁡(y^)$ → We want $y^$ to be close to 1
If : The cost becomes $−log⁡(1−y^)$ → We want $y^$ to be close to 0

So the cost gets really big if the model predicts the wrong class confidently.

To make the model better, we want to reduce the cost function (get better predictions).

To reduce the cost function, we will use Gradient Descent.

Gradient Descent

Imagine we are walking down a hill in fog to find the lowest point (lowest error):

Take a step in the direction of the steepest slope (gradient).
Keep stepping until the slope flattens we are at the minimum.

In logistic regression, this means updating the model’s parameters () like this:

The computer does this automatically. We just need to set the learning rate and let it iterate.

How Model Works:

Model starts with random guesses.
Cost is high at first.
Each step via gradient descent makes the predictions better.
Cost keeps going down.
Training stops when cost is low enough or doesn’t improve.

Python Implementation for Logistic Regression:

Business Case: Based on the given features, predict whether a person will have diabetes or not.

Importing the necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the datasets

data=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv')

Basic Checks:

See first five rows

data.head()

Output:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1

See the last five rows

data.tail()

Output:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
763           10      101             76             48      180  32.9                     0.171   63        0
764            2      122             70             27        0  36.8                     0.340   27        0
765            5      121             72             23      112  26.2                     0.245   30        0
766            1      126             60              0        0  30.1                     0.349   47        1
767            1       93             70             31        0  30.4                     0.315   23        0

data.describe()

Output:
      Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin         BMI  DiabetesPedigreeFunction         Age     Outcome
count   768.000000  768.000000     768.000000     768.000000  768.000000  768.000000                768.000000  768.000000  768.000000
mean      3.845052  120.894531      69.105469      20.536458   79.799479   31.992578                  0.471876   33.240885    0.348958
std       3.369578   31.972618      19.355807      15.952218  115.244002    7.884160                  0.331329   11.760232    0.476951
min       0.000000    0.000000       0.000000       0.000000    0.000000    0.000000                  0.078000   21.000000    0.000000
25%       1.000000   99.000000      62.000000       0.000000    0.000000   27.300000                  0.243750   24.000000    0.000000
50%       3.000000  117.000000      72.000000      23.000000   30.500000   32.000000                  0.372500   29.000000    0.000000
75%       6.000000  140.250000      80.000000      32.000000  127.250000   36.600000                  0.626250   41.000000    1.000000
max      17.000000  199.000000     122.000000      99.000000  846.000000   67.100000                  2.420000   81.000000    1.000000

In ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, and ‘BMI’ certain datapoints are zero although these parameter value cannot be zero, so definitely these are corrupted data

data.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Exploratory Data Analysis

Univariate Analysis

sns.countplot(x='Pregnancies',data=data)
# Maximum patients have conceived  1 and 0 times.

plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1

for column in data:
    if plotnumber<=9:
        ax=plt.subplot(3,3,plotnumber)
        sns.histplot(data[column])
        plt.xlabel(column,fontsize=8)
        plt.ylabel('Count',fontsize=8)
    plotnumber+=1
plt.tight_layout()

Bivariate Analysis

## Analyzing how preganancies impact the patient with diabetes.
sns.countplot(x='Pregnancies',hue='Outcome',data=data)
plt.show()

## Aanlyzing the relationship between diabetes and Glucose

sns.histplot(x='Glucose',hue='Outcome',data=data)

## Analyze Glucose with blood pressure

sns.relplot(x='Glucose',y='BloodPressure',hue='Outcome',data=data)
plt.show()

## Analyze Glucose with SkinThickness

sns.relplot(x='Glucose',y='SkinThickness',hue='Outcome',data=data)
plt.show()

## Analyze relationship between BloodPressure and Outcome

sns.histplot(x='BloodPressure',hue='Outcome',data=data)

## Analyze BP with SkinThickness

sns.relplot(x='BloodPressure',y='SkinThickness',hue='Outcome',data=data)
plt.show()

## Analyze BP with Insulin

sns.relplot(x='BloodPressure',y='Insulin',col='Outcome',data=data)
plt.show()

## Analyzing Insulin with target

sns.histplot(x='Insulin',hue='Outcome',data=data)

Data Preprocessing and Feature Engineering:

Handling the missing values

data.isnull().sum()

Output:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Handling the corrupted data.

Our corrupted variables are ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’,’BMI’ as they have certain datapoints are zero.

#Locate datasets where Glucose data corrupted

data.loc[data['Glucose']==0]

Output:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
75             1        0             48             20        0  24.7                     0.140   22        0
182            1        0             74             20       23  27.7                     0.299   21        0
342            1        0             68             35        0  32.0                     0.389   22        0
349            5        0             80             32        0  41.0                     0.346   37        1
502            6        0             68             41        0  39.0                     0.727   41        1

# Replace zero value with mean 

data.Glucose.replace(0,np.mean(data.Glucose),inplace=True)

#Check if zero is still available

data.loc[data['Glucose']==0]

Output:
Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
Index: []

# Similar way impute value for rest variables

data.BloodPressure.replace(0,np.mean(data.BloodPressure),inplace=True)
data.SkinThickness.replace(0,np.median(data.SkinThickness),inplace=True)
data.Insulin.replace(0,np.median(data.Insulin),inplace=True)
data.BMI.replace(0,np.mean(data.BMI),inplace=True)

Checking the outliers

plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1

for column in data:
    if plotnumber<=9:
        ax=plt.subplot(3,3,plotnumber)
        sns.boxplot(data[column])
        plt.xlabel(column,fontsize=8)
        plt.ylabel('Count',fontsize=8)
    plotnumber+=1
plt.tight_layout()

Scaling the data

from sklearn.preprocessing import MinMaxScaler

sc=MinMaxScaler()
dl=['Pregnancies','Outcome']  # We do not need to scale these two variables as they already is smaller  magnitude
data1=sc.fit_transform(data.drop(dl,axis=1))

data2=pd.DataFrame(data1,columns=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'])

Small_Magnitude_data=data[['Pregnancies','Outcome']]

final_df=pd.concat([data2,Small_Magnitude_data],axis=1)
final_df

Output:
    Glucose  BloodPressure  SkinThickness   Insulin       BMI  DiabetesPedigreeFunction       Age  Pregnancies  Outcome
0  0.670968       0.489796       0.304348  0.019832  0.314928                  0.234415  0.483333            6        1
1  0.264516       0.428571       0.239130  0.019832  0.171779                  0.116567  0.166667            1        0
2  0.896774       0.408163       0.173913  0.019832  0.104294                  0.253629  0.183333            8        1
3  0.290323       0.428571       0.173913  0.096154  0.202454                  0.038002  0.000000            1        0
4  0.600000       0.163265       0.304348  0.185096  0.509202                  0.943638  0.200000            0        1

Feature Selection

## No redundant fetaures
## We will check correlation

sns.heatmap(data2.corr(),annot=True)

# So no correlation hence no features should be

Model Creation

Creating independent and dependent variable.

X=final_df.iloc[:,:-1]
y=final_df.Outcome

Creating training and testing data.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=45)

Model creation

from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)  ## training

Prediction

y_pred=clf.predict(X_test)
y_pred

Output:
array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0])

y_pred_prob=clf.predict_proba(X_test)
y_pred_prob

Output:
array([[0.74189213, 0.25810787],
       [0.54002809, 0.45997191],
       [0.81116822, 0.18883178],
       [0.85453471, 0.14546529],
       [0.54602547, 0.45397453],
       [0.92399497, 0.07600503],
       [0.36275549, 0.63724451],
       [0.91507154, 0.08492846.
       ........................

data.Outcome.value_counts()

Output:
0    500
1    268
Name: Outcome, dtype: int64

Model Evaluation:

from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score

cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)

recall=recall_score(y_test,y_pred)
print("\nrecall",recall)

precision=precision_score(y_test,y_pred)
print("\nprecision",precision)

f1score=f1_score(y_test,y_pred)
print("\nf1score",f1score)

cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)

Output:
confusion_matrix
 [[113  17]
 [ 31  31]]

recall 0.5

precision 0.6458333333333334

f1score 0.5636363636363636

classification_report
               precision    recall  f1-score   support

           0       0.78      0.87      0.82       130
           1       0.65      0.50      0.56        62

    accuracy                           0.75       192
   macro avg       0.72      0.68      0.69       192
weighted avg       0.74      0.75      0.74       192

Multiclass Classification

Problem statement:-Based on features like sepal,petal length and width,predict the species of iris flower

#Importing the necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

df=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
df

Output:
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

Note: Let’s skip data preprocessing & feature engineering, go to model building straight way

Model Creation:

X=df.iloc[:,:-1]
y=df.Name

## Training and testing data
from sklearn.model_selection import train_test_split
X_train1,X_test1,y_train,y_test=train_test_split(X,y,random_state=25)

#Building model

from sklearn.linear_model import LogisticRegression
lr_multi=LogisticRegression(multi_class='ovr')
lr_multi.fit(X_train1,y_train)
y_pred=lr_multi.predict(X_test1)

Note: “ovr” stands for “One-vs-Rest” (also known as “One-vs-All”). This is a strategy used for multi-class classification.

Model Evaluation:

from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score

cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)

recall=recall_score(y_test,y_pred,average='weighted')
print("\nrecall",recall)

precision=precision_score(y_test,y_pred,average='weighted')
print("\nprecision",precision)

f1score=f1_score(y_test,y_pred,average='weighted')
print("\nf1score",f1score)

cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)

Output:

confusion_matrix
 [[113  17]
 [ 31  31]]

recall 0.75

precision 0.7398726851851851

f1score 0.7404777704047777

classification_report
               precision    recall  f1-score   support

           0       0.78      0.87      0.82       130
           1       0.65      0.50      0.56        62

    accuracy                           0.75       192
   macro avg       0.72      0.68      0.69       192
weighted avg       0.74      0.75      0.74       192