Logistic Regression:

  • Logistic regression is a supervised machine learning algorithm used for classification tasks, aiming to predict the probability that an instance belongs to a specific class.
  • If the estimated probability is greater than 50% then the model predicts that the instance belongs to that class(called the positive class, labeled as ‘1’), or else it predicts that it does not (that is it belongs to the negative class, labeled as ‘o’ ).
  • This makes it a binary classifier
  • The equation of Logistic Regression is

Theory To be updated later …….

Python Implementation for Logistic Regression

Business Case: Based on the given features predict whether a person will have diabetes or not.

Domain Analysis

1)Pregnancies:-Some women have diabetes before they get pregnant. This is called pregestational diabetes. Other women may get a type of diabetes that only happens in pregnancy. This is called gestational diabetes. Pregnancy can change how a woman’s body uses glucose. This can make diabetes worse, or lead to gestational diabetes.

If you have gestational diabetes during pregnancy, generally your blood sugar returns to its usual level soon after delivery. But if you’ve had gestational diabetes, you have a higher risk of getting type 2 diabetes. You’ll need to be tested for changes in blood sugar more often.

The risk of getting diabetes is 28% if the patient has concived more than 2 or 3 times.

2)Glucose:-Glucose is your body’s source of fuel. Your pancreas makes insulin to move glucose from your bloodstream into muscle, fat, and liver cells, where your body turns it into energy. People with diabetes have too much blood sugar because their body cannot move glucose into fat, liver, and muscle cells to be changed into and stored for energy.

3)Blood Pressure:-A person with diabetes is twice as likely to have high blood pressure as someone who does not have diabetes. When you have diabetes, high blood sugar can damage your blood vessels and the nerves that help your heart pump.Similarly, high blood pressure can create increased strain on your heart and blood vessels. When these two conditions occur together, they increase the risk of heart disease (cardiovascular disease) and stroke. High blood pressure:- According to a 2018 article, people with high blood pressure usually have insulin resistance and have an increased risk of developing diabetes compared to those with typical blood pressure. Blood pressure should be below 140/80mmHg for people with diabetes or below 130/80mmHg if you have kidney or eye disease or any condition that affects blood vessels and blood supply to the brain.

4)Skin Thickness:-Skin thickening is frequently observed in patients with diabetes. Affected areas of skin can appear thickened, waxy, or edematous. These patients are often asymptomatic but can have a reduction in sensation and pain. Although different parts of the body can be involved, the hands and feet are most frequently involved.Diabetes can cause changes in the small blood vessels. These changes can cause skin problems called diabetic dermopathy. Dermopathy often looks like light brown, scaly patches. These patches may be oval or circular.

5)Insulin:-Insulin is a hormone your pancreas makes to lower blood glucose, or sugar. If you have diabetes, your pancreas either doesn’t make enough insulin or your body doesn’t respond well to it. Your body needs insulin to keep the blood sugar level in a healthy range.Type 1 diabetes causes damage to the beta cells in your pancreas that make insulin. As a result, your body can’t produce enough of this hormone. Type 2 diabetes gradually makes it harder for your be

6)BMI:-Body mass index has a strong relationship to diabetes and insulin resistance. In obese individuals, the amount of nonesterified fatty acids, glycerol, hormones, cytokines, proinflammatory markers, and other substances that are involved in the development of insulin resistance, is increased. The pathogenesis in the development of diabetes is based on the fact that the β-islet cells of the pancreas are impaired, causing a lack of control of blood glucose. The development of diabetes becomes more inevitable if the failure of β-islet cells of the pancreas is accompanied by insulin resistance. Weight gain and body mass are central to the formation and rising incidence of type 1 and type 2 diabetes.

8)Age:-The prevalence of both type 2 diabetes and prediabetes increases with advancing age. The most important factors leading to hyperglycaemia are as follows: deficiency of insulin secretion developing with age, and growing insulin resistance caused by a change in body composition and sarcopaenia.The process of aging of the human body leads to impairment of energy homeostasis and abnormalities in carbohydrate metabolism. The most important causes of hyperglycaemia are thought to be deficiency of insulin secretion developing with age and growing insulin resistance.

  • Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
  • Load the datasets
data=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv')

Basic Checks:

  • See first five rows
data.head()
Output:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1
  • See the last five rows
data.tail()
Output:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
763           10      101             76             48      180  32.9                     0.171   63        0
764            2      122             70             27        0  36.8                     0.340   27        0
765            5      121             72             23      112  26.2                     0.245   30        0
766            1      126             60              0        0  30.1                     0.349   47        1
767            1       93             70             31        0  30.4                     0.315   23        0
data.describe()
Output:
      Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin         BMI  DiabetesPedigreeFunction         Age     Outcome
count   768.000000  768.000000     768.000000     768.000000  768.000000  768.000000                768.000000  768.000000  768.000000
mean      3.845052  120.894531      69.105469      20.536458   79.799479   31.992578                  0.471876   33.240885    0.348958
std       3.369578   31.972618      19.355807      15.952218  115.244002    7.884160                  0.331329   11.760232    0.476951
min       0.000000    0.000000       0.000000       0.000000    0.000000    0.000000                  0.078000   21.000000    0.000000
25%       1.000000   99.000000      62.000000       0.000000    0.000000   27.300000                  0.243750   24.000000    0.000000
50%       3.000000  117.000000      72.000000      23.000000   30.500000   32.000000                  0.372500   29.000000    0.000000
75%       6.000000  140.250000      80.000000      32.000000  127.250000   36.600000                  0.626250   41.000000    1.000000
max      17.000000  199.000000     122.000000      99.000000  846.000000   67.100000                  2.420000   81.000000    1.000000
  • In ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, and ‘BMI’ certain datapoints are zero although these parameter value cannot be zero, so definitely these are corrupted data
data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Exploratory Data Analysis

Univariate Analysis
sns.countplot(x='Pregnancies',data=data)
# Maximum patients have conceived  1 and 0 times.

plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1

for column in data:
    if plotnumber<=9:
        ax=plt.subplot(3,3,plotnumber)
        sns.histplot(data[column])
        plt.xlabel(column,fontsize=8)
        plt.ylabel('Count',fontsize=8)
    plotnumber+=1
plt.tight_layout()

Bivariate Analysis
## Analyzing how preganancies impact the patient with diabetes.
sns.countplot(x='Pregnancies',hue='Outcome',data=data)
plt.show()

## Aanlyzing the relationship between diabetes and Glucose

sns.histplot(x='Glucose',hue='Outcome',data=data)

## Analyze Glucose with blood pressure

sns.relplot(x='Glucose',y='BloodPressure',hue='Outcome',data=data)
plt.show()

## Analyze Glucose with SkinThickness

sns.relplot(x='Glucose',y='SkinThickness',hue='Outcome',data=data)
plt.show()

## Analyze relationship between BloodPressure and Outcome

sns.histplot(x='BloodPressure',hue='Outcome',data=data)

## Analyze BP with SkinThickness

sns.relplot(x='BloodPressure',y='SkinThickness',hue='Outcome',data=data)
plt.show()

## Analyze BP with Insulin

sns.relplot(x='BloodPressure',y='Insulin',col='Outcome',data=data)
plt.show()

## Analyzing Insulin with target

sns.histplot(x='Insulin',hue='Outcome',data=data)

Data Preprocessing and Feature Engineering:

  • Handling the missing values
data.isnull().sum()
Output:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
  • Handling the corrupted data.

Our corrupted variables are ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’,’BMI’ as they have certain datapoints are zero.

#Locate datasets where Glucose data corrupted

data.loc[data['Glucose']==0]
Output:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
75             1        0             48             20        0  24.7                     0.140   22        0
182            1        0             74             20       23  27.7                     0.299   21        0
342            1        0             68             35        0  32.0                     0.389   22        0
349            5        0             80             32        0  41.0                     0.346   37        1
502            6        0             68             41        0  39.0                     0.727   41        1
# Replace zero value with mean 

data.Glucose.replace(0,np.mean(data.Glucose),inplace=True)
#Check if zero is still available

data.loc[data['Glucose']==0]
Output:
Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
Index: []
# Similar way impute value for rest variables

data.BloodPressure.replace(0,np.mean(data.BloodPressure),inplace=True)
data.SkinThickness.replace(0,np.median(data.SkinThickness),inplace=True)
data.Insulin.replace(0,np.median(data.Insulin),inplace=True)
data.BMI.replace(0,np.mean(data.BMI),inplace=True)

  • Checking the outliers
plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1

for column in data:
    if plotnumber<=9:
        ax=plt.subplot(3,3,plotnumber)
        sns.boxplot(data[column])
        plt.xlabel(column,fontsize=8)
        plt.ylabel('Count',fontsize=8)
    plotnumber+=1
plt.tight_layout()

  • Scaling the data
from sklearn.preprocessing import MinMaxScaler

sc=MinMaxScaler()
dl=['Pregnancies','Outcome']  # We do not need to scale these two variables as they already is smaller  magnitude
data1=sc.fit_transform(data.drop(dl,axis=1))

data2=pd.DataFrame(data1,columns=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'])

Small_Magnitude_data=data[['Pregnancies','Outcome']]

final_df=pd.concat([data2,Small_Magnitude_data],axis=1)
final_df
Output:
    Glucose  BloodPressure  SkinThickness   Insulin       BMI  DiabetesPedigreeFunction       Age  Pregnancies  Outcome
0  0.670968       0.489796       0.304348  0.019832  0.314928                  0.234415  0.483333            6        1
1  0.264516       0.428571       0.239130  0.019832  0.171779                  0.116567  0.166667            1        0
2  0.896774       0.408163       0.173913  0.019832  0.104294                  0.253629  0.183333            8        1
3  0.290323       0.428571       0.173913  0.096154  0.202454                  0.038002  0.000000            1        0
4  0.600000       0.163265       0.304348  0.185096  0.509202                  0.943638  0.200000            0        1

Feature Selection

## No redundant fetaures
## We will check correlation

sns.heatmap(data2.corr(),annot=True)

# So no correlation hence no features should be

Model Creation

  • Creating independent and dependent variable.
X=final_df.iloc[:,:-1]
y=final_df.Outcome
  • Creating training and testing data.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=45)
  • Model creation
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)  ## training
  • Prediction
y_pred=clf.predict(X_test)
y_pred
Output:
array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0])
y_pred_prob=clf.predict_proba(X_test)
y_pred_prob
Output:
array([[0.74189213, 0.25810787],
       [0.54002809, 0.45997191],
       [0.81116822, 0.18883178],
       [0.85453471, 0.14546529],
       [0.54602547, 0.45397453],
       [0.92399497, 0.07600503],
       [0.36275549, 0.63724451],
       [0.91507154, 0.08492846.
       ........................
data.Outcome.value_counts()
Output:
0    500
1    268
Name: Outcome, dtype: int64

Model Evaluation:

from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)

recall=recall_score(y_test,y_pred)
print("\nrecall",recall)

precision=precision_score(y_test,y_pred)
print("\nprecision",precision)

f1score=f1_score(y_test,y_pred)
print("\nf1score",f1score)

cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output:
confusion_matrix
 [[113  17]
 [ 31  31]]

recall 0.5

precision 0.6458333333333334

f1score 0.5636363636363636

classification_report
               precision    recall  f1-score   support

           0       0.78      0.87      0.82       130
           1       0.65      0.50      0.56        62

    accuracy                           0.75       192
   macro avg       0.72      0.68      0.69       192
weighted avg       0.74      0.75      0.74       192

Multiclass Classification

Problem statement:-Based on features like sepal,petal length and width,predict the species of iris flower

#Importing the necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
df
Output:
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

Note: Let’s skip data preprocessing & feature engineering, go to model building straight way

Model Creation:

X=df.iloc[:,:-1]
y=df.Name
## Training and testing data
from sklearn.model_selection import train_test_split
X_train1,X_test1,y_train,y_test=train_test_split(X,y,random_state=25)
#Building model

from sklearn.linear_model import LogisticRegression
lr_multi=LogisticRegression(multi_class='ovr')
lr_multi.fit(X_train1,y_train)
y_pred=lr_multi.predict(X_test1)

Note: “ovr” stands for “One-vs-Rest” (also known as “One-vs-All”). This is a strategy used for multi-class classification.

Model Evaluation:

from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)

recall=recall_score(y_test,y_pred,average='weighted')
print("\nrecall",recall)

precision=precision_score(y_test,y_pred,average='weighted')
print("\nprecision",precision)

f1score=f1_score(y_test,y_pred,average='weighted')
print("\nf1score",f1score)

cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output:

confusion_matrix
 [[113  17]
 [ 31  31]]

recall 0.75

precision 0.7398726851851851

f1score 0.7404777704047777

classification_report
               precision    recall  f1-score   support

           0       0.78      0.87      0.82       130
           1       0.65      0.50      0.56        62

    accuracy                           0.75       192
   macro avg       0.72      0.68      0.69       192
weighted avg       0.74      0.75      0.74       192

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies