Logistic Regression:
- Logistic regression is a supervised machine learning algorithm used for classification tasks, aiming to predict the probability that an instance belongs to a specific class.
- If the estimated probability is greater than 50% then the model predicts that the instance belongs to that class(called the positive class, labeled as ‘1’), or else it predicts that it does not (that is it belongs to the negative class, labeled as ‘o’ ).
- This makes it a binary classifier
- The equation of Logistic Regression is
Theory To be updated later …….
Python Implementation for Logistic Regression
Business Case: Based on the given features predict whether a person will have diabetes or not.
Domain Analysis
1)Pregnancies:-Some women have diabetes before they get pregnant. This is called pregestational diabetes. Other women may get a type of diabetes that only happens in pregnancy. This is called gestational diabetes. Pregnancy can change how a woman’s body uses glucose. This can make diabetes worse, or lead to gestational diabetes.
If you have gestational diabetes during pregnancy, generally your blood sugar returns to its usual level soon after delivery. But if you’ve had gestational diabetes, you have a higher risk of getting type 2 diabetes. You’ll need to be tested for changes in blood sugar more often.
The risk of getting diabetes is 28% if the patient has concived more than 2 or 3 times.
2)Glucose:-Glucose is your body’s source of fuel. Your pancreas makes insulin to move glucose from your bloodstream into muscle, fat, and liver cells, where your body turns it into energy. People with diabetes have too much blood sugar because their body cannot move glucose into fat, liver, and muscle cells to be changed into and stored for energy.
3)Blood Pressure:-A person with diabetes is twice as likely to have high blood pressure as someone who does not have diabetes. When you have diabetes, high blood sugar can damage your blood vessels and the nerves that help your heart pump.Similarly, high blood pressure can create increased strain on your heart and blood vessels. When these two conditions occur together, they increase the risk of heart disease (cardiovascular disease) and stroke. High blood pressure:- According to a 2018 article, people with high blood pressure usually have insulin resistance and have an increased risk of developing diabetes compared to those with typical blood pressure. Blood pressure should be below 140/80mmHg for people with diabetes or below 130/80mmHg if you have kidney or eye disease or any condition that affects blood vessels and blood supply to the brain.
4)Skin Thickness:-Skin thickening is frequently observed in patients with diabetes. Affected areas of skin can appear thickened, waxy, or edematous. These patients are often asymptomatic but can have a reduction in sensation and pain. Although different parts of the body can be involved, the hands and feet are most frequently involved.Diabetes can cause changes in the small blood vessels. These changes can cause skin problems called diabetic dermopathy. Dermopathy often looks like light brown, scaly patches. These patches may be oval or circular.
5)Insulin:-Insulin is a hormone your pancreas makes to lower blood glucose, or sugar. If you have diabetes, your pancreas either doesn’t make enough insulin or your body doesn’t respond well to it. Your body needs insulin to keep the blood sugar level in a healthy range.Type 1 diabetes causes damage to the beta cells in your pancreas that make insulin. As a result, your body can’t produce enough of this hormone. Type 2 diabetes gradually makes it harder for your be
6)BMI:-Body mass index has a strong relationship to diabetes and insulin resistance. In obese individuals, the amount of nonesterified fatty acids, glycerol, hormones, cytokines, proinflammatory markers, and other substances that are involved in the development of insulin resistance, is increased. The pathogenesis in the development of diabetes is based on the fact that the β-islet cells of the pancreas are impaired, causing a lack of control of blood glucose. The development of diabetes becomes more inevitable if the failure of β-islet cells of the pancreas is accompanied by insulin resistance. Weight gain and body mass are central to the formation and rising incidence of type 1 and type 2 diabetes.
8)Age:-The prevalence of both type 2 diabetes and prediabetes increases with advancing age. The most important factors leading to hyperglycaemia are as follows: deficiency of insulin secretion developing with age, and growing insulin resistance caused by a change in body composition and sarcopaenia.The process of aging of the human body leads to impairment of energy homeostasis and abnormalities in carbohydrate metabolism. The most important causes of hyperglycaemia are thought to be deficiency of insulin secretion developing with age and growing insulin resistance.
- Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
- Load the datasets
data=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv')
Basic Checks:
- See first five rows
data.head()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1
- See the last five rows
data.tail()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 763 10 101 76 48 180 32.9 0.171 63 0 764 2 122 70 27 0 36.8 0.340 27 0 765 5 121 72 23 112 26.2 0.245 30 0 766 1 126 60 0 0 30.1 0.349 47 1 767 1 93 70 31 0 30.4 0.315 23 0
data.describe()
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
- In ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, and ‘BMI’ certain datapoints are zero although these parameter value cannot be zero, so definitely these are corrupted data
data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
Exploratory Data Analysis
sns.countplot(x='Pregnancies',data=data)
# Maximum patients have conceived 1 and 0 times.
plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1
for column in data:
if plotnumber<=9:
ax=plt.subplot(3,3,plotnumber)
sns.histplot(data[column])
plt.xlabel(column,fontsize=8)
plt.ylabel('Count',fontsize=8)
plotnumber+=1
plt.tight_layout()
## Analyzing how preganancies impact the patient with diabetes.
sns.countplot(x='Pregnancies',hue='Outcome',data=data)
plt.show()
## Aanlyzing the relationship between diabetes and Glucose
sns.histplot(x='Glucose',hue='Outcome',data=data)
## Analyze Glucose with blood pressure
sns.relplot(x='Glucose',y='BloodPressure',hue='Outcome',data=data)
plt.show()
## Analyze Glucose with SkinThickness
sns.relplot(x='Glucose',y='SkinThickness',hue='Outcome',data=data)
plt.show()
## Analyze relationship between BloodPressure and Outcome
sns.histplot(x='BloodPressure',hue='Outcome',data=data)
## Analyze BP with SkinThickness
sns.relplot(x='BloodPressure',y='SkinThickness',hue='Outcome',data=data)
plt.show()
## Analyze BP with Insulin
sns.relplot(x='BloodPressure',y='Insulin',col='Outcome',data=data)
plt.show()
## Analyzing Insulin with target
sns.histplot(x='Insulin',hue='Outcome',data=data)
Data Preprocessing and Feature Engineering:
- Handling the missing values
data.isnull().sum()
Output: Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
- Handling the corrupted data.
Our corrupted variables are ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’,’BMI’ as they have certain datapoints are zero.
#Locate datasets where Glucose data corrupted
data.loc[data['Glucose']==0]
Output: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 75 1 0 48 20 0 24.7 0.140 22 0 182 1 0 74 20 23 27.7 0.299 21 0 342 1 0 68 35 0 32.0 0.389 22 0 349 5 0 80 32 0 41.0 0.346 37 1 502 6 0 68 41 0 39.0 0.727 41 1
# Replace zero value with mean
data.Glucose.replace(0,np.mean(data.Glucose),inplace=True)
#Check if zero is still available
data.loc[data['Glucose']==0]
Output: Empty DataFrame Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome] Index: []
# Similar way impute value for rest variables
data.BloodPressure.replace(0,np.mean(data.BloodPressure),inplace=True)
data.SkinThickness.replace(0,np.median(data.SkinThickness),inplace=True)
data.Insulin.replace(0,np.median(data.Insulin),inplace=True)
data.BMI.replace(0,np.mean(data.BMI),inplace=True)
- Checking the outliers
plt.figure(figsize=(8,7),facecolor='white')
plotnumber=1
for column in data:
if plotnumber<=9:
ax=plt.subplot(3,3,plotnumber)
sns.boxplot(data[column])
plt.xlabel(column,fontsize=8)
plt.ylabel('Count',fontsize=8)
plotnumber+=1
plt.tight_layout()
- Scaling the data
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler()
dl=['Pregnancies','Outcome'] # We do not need to scale these two variables as they already is smaller magnitude
data1=sc.fit_transform(data.drop(dl,axis=1))
data2=pd.DataFrame(data1,columns=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
Small_Magnitude_data=data[['Pregnancies','Outcome']]
final_df=pd.concat([data2,Small_Magnitude_data],axis=1)
final_df
Output: Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Pregnancies Outcome 0 0.670968 0.489796 0.304348 0.019832 0.314928 0.234415 0.483333 6 1 1 0.264516 0.428571 0.239130 0.019832 0.171779 0.116567 0.166667 1 0 2 0.896774 0.408163 0.173913 0.019832 0.104294 0.253629 0.183333 8 1 3 0.290323 0.428571 0.173913 0.096154 0.202454 0.038002 0.000000 1 0 4 0.600000 0.163265 0.304348 0.185096 0.509202 0.943638 0.200000 0 1
Feature Selection
## No redundant fetaures
## We will check correlation
sns.heatmap(data2.corr(),annot=True)
# So no correlation hence no features should be
Model Creation
- Creating independent and dependent variable.
X=final_df.iloc[:,:-1]
y=final_df.Outcome
- Creating training and testing data.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=45)
- Model creation
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train) ## training
- Prediction
y_pred=clf.predict(X_test)
y_pred
Output: array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0])
y_pred_prob=clf.predict_proba(X_test)
y_pred_prob
Output: array([[0.74189213, 0.25810787], [0.54002809, 0.45997191], [0.81116822, 0.18883178], [0.85453471, 0.14546529], [0.54602547, 0.45397453], [0.92399497, 0.07600503], [0.36275549, 0.63724451], [0.91507154, 0.08492846. ........................
data.Outcome.value_counts()
Output: 0 500 1 268 Name: Outcome, dtype: int64
Model Evaluation:
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)
recall=recall_score(y_test,y_pred)
print("\nrecall",recall)
precision=precision_score(y_test,y_pred)
print("\nprecision",precision)
f1score=f1_score(y_test,y_pred)
print("\nf1score",f1score)
cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output: confusion_matrix [[113 17] [ 31 31]] recall 0.5 precision 0.6458333333333334 f1score 0.5636363636363636 classification_report precision recall f1-score support 0 0.78 0.87 0.82 130 1 0.65 0.50 0.56 62 accuracy 0.75 192 macro avg 0.72 0.68 0.69 192 weighted avg 0.74 0.75 0.74 192
Multiclass Classification
Problem statement:-Based on features like sepal,petal length and width,predict the species of iris flower
#Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris.csv')
df
Output: SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Note: Let’s skip data preprocessing & feature engineering, go to model building straight way
Model Creation:
X=df.iloc[:,:-1]
y=df.Name
## Training and testing data
from sklearn.model_selection import train_test_split
X_train1,X_test1,y_train,y_test=train_test_split(X,y,random_state=25)
#Building model
from sklearn.linear_model import LogisticRegression
lr_multi=LogisticRegression(multi_class='ovr')
lr_multi.fit(X_train1,y_train)
y_pred=lr_multi.predict(X_test1)
Note: “ovr” stands for “One-vs-Rest” (also known as “One-vs-All”). This is a strategy used for multi-class classification.
Model Evaluation:
from sklearn.metrics import confusion_matrix,accuracy_score,recall_score, precision_score,classification_report,f1_score
cm=confusion_matrix(y_test,y_pred)
print("\nconfusion_matrix\n",cm)
recall=recall_score(y_test,y_pred,average='weighted')
print("\nrecall",recall)
precision=precision_score(y_test,y_pred,average='weighted')
print("\nprecision",precision)
f1score=f1_score(y_test,y_pred,average='weighted')
print("\nf1score",f1score)
cr=classification_report(y_test,y_pred)
print("\nclassification_report\n",cr)
Output: confusion_matrix [[113 17] [ 31 31]] recall 0.75 precision 0.7398726851851851 f1score 0.7404777704047777 classification_report precision recall f1-score support 0 0.78 0.87 0.82 130 1 0.65 0.50 0.56 62 accuracy 0.75 192 macro avg 0.72 0.68 0.69 192 weighted avg 0.74 0.75 0.74 192