Data Preprocessing:
Data preprocessing involves cleaning and transforming raw data to make it suitable for analysis.
This can include tasks such as removing missing values, scaling numerical features, encoding categorical variables, etc.
The goal of data preprocessing is to prepare the data for modeling by ensuring it is in the correct format, free of errors and inconsistencies, and ready for further analysis.
Basic Techniques:
- Removing unwanted column
- Removing duplicated value
- Imputing missing values
- Encoding categorical variables
- Removing the outlier
- Data normalization/scaling
- Transformation
- Balancing the data
When to do Preprocessing:
It is generally recommended to perform data preprocessing after splitting the data into training and testing sets. Here’s the corrected order of operations:
Split the original dataset into training and testing sets. This should be done before any preprocessing steps.
Perform data preprocessing steps, such as handling missing values, encoding categorical variables, feature scaling, or any other necessary transformations, on the training set only. Remember to keep track of the preprocessing steps applied.
Apply the same preprocessing steps that were performed on the training set to the testing set. This ensures that the testing set is processed in the same way as the training set, allowing for a fair evaluation of the model’s performance.
The main reason for this order is to avoid any data leakage from the testing set into the training set. By preprocessing the data separately for each set, you ensure that the model is trained and evaluated on independent and unbiased data.
It’s important to note that some preprocessing steps, such as calculating statistics for imputation or feature scaling, may require information from the entire dataset. In such cases, it is still recommended to calculate those statistics using only the training set and then apply them to both the training and testing sets.
Overall, the correct order is to split the data first, then perform preprocessing on the training set, and finally apply the same preprocessing steps to the testing set.
Removing Unwanted Column:
- Sometimes, we need to remove some columns using .drop() like id column, Serial column etc
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Gender': ['Female', 'Male', 'Male', 'Male'],
'Salary': [70000, 80000, 90000, 100000],
'Unwanted_Column': [1, 2, 3, 4] # This is the column we want to remove
}
df = pd.DataFrame(data)
# Remove a single column
df_cleaned = df.drop(columns=['Unwanted_Column'])
# Alternatively, remove multiple columns (e.g., 'Gender' and 'Unwanted_Column')
# df_cleaned = df.drop(columns=['Gender', 'Unwanted_Column'])
Removing Duplicated Value:
Use the .duplicated() & .duplicated().sum() method to identify the duplicated rows in your dataset.
Once you have identified the duplicates, remove them using the .drop_duplicates() method.
This will keep only the first occurrence of each unique value and eliminate subsequent duplicates.
Identify Duplicate Rows:
#Creating dummy dataframe
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'David'],
'Age': [25, 30, 35, 25, 30, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Houston']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
#Identify Duplicate Rows:
duplicates = df.duplicated()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)
duplicates = df.duplicated().sum()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)
Output: Original DataFrame: Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 Alice 25 New York 4 Bob 30 Los Angeles 5 David 40 Houston Duplicate Rows (Boolean Series): 0 False 1 False 2 False 3 True 4 True 5 False dtype: bool Duplicate Rows (Boolean Series): 2
Display Only Duplicate Rows:
duplicate_rows = df[df.duplicated()]
print("\nDuplicate Rows (DataFrame):")
print(duplicate_rows)
Output: Duplicate Rows (DataFrame): Name Age City 3 Alice 25 New York 4 Bob 30 Los Angeles
Remove Duplicate Rows:
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
Output: DataFrame after removing duplicates: Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 5 David 40 Houston
Imputing Missing Values:
- A null value or missing value in the context of data analysis refers to an absence of data in a dataset.
- This means that a specific entry or observation for a certain variable (column) is not available or hasn’t been recorded.
Checking the missing value:
- Import data set
import numpy as np
import pandas as pd
df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /Test_loan_approved.csv')
- Missing (null) values in a DataFrame and get a summary of them.
# Checking the missing value
summary = df_dm.isnull().sum()
print(summary)
# Checking the missing value of single column
missing_value_single = df_dm.Gender.isnull().sum()
print("\nTotal number of missing value in gender: ",missing_value_single)
Output: Loan_ID 0 Gender 13 Married 3 Education 0 Self_Employed 32 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Loan_Status (Approved) 0 dtype: int64 Total number of missing value in gender: 13
- Getting the indexes of row where values are missing in specific column using where
missing_index1 = np.where(df_dm.Gender.isnull()) #or
print(missing_index1,)
missing_index2 =np.where(df_dm.Gender.isnull()==True)
print("\n",missing_index2)
Output: (array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),) (array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),)
- Getting the actual data(row) from the indexes
df_dm.loc[missing_index2]
Output: Loan_ID Gender Married Education Self_Employed LoanAmount Loan_Amount_Term Credit_History Loan_Status (Approved) 23 LP001050 NaN Yes Not Graduate No 112.0 360.0 0.0 N 126 LP001448 NaN Yes Graduate No 370.0 360.0 1.0 Y 171 LP001585 NaN Yes Graduate No 700.0 300.0 1.0 Y 188 LP001644 NaN Yes Graduate Yes 168.0 360.0 1.0 Y 314 LP002024 NaN Yes Graduate No 159.0 360.0 1.0 N 334 LP002103 NaN Yes Graduate Yes 182.0 180.0 1.0 Y 460 LP002478 NaN Yes Graduate Yes 160.0 360.0 NaN Y 467 LP002501 NaN Yes Graduate No 110.0 360.0 1.0 Y 477 LP002530 NaN Yes Graduate No 132.0 360.0 0.0 N 507 LP002625 NaN No Graduate No 96.0 360.0 1.0 N 576 LP002872 NaN Yes Graduate No 136.0 360.0 0.0 N 588 LP002925 NaN No Graduate No 94.0 360.0 1.0 Y 592 LP002933 NaN No Graduate Yes 292.0 360.0 1.0 Y
- Getting missing values by loc function based on single column
df_dm.loc[df_dm['Gender'].isnull()]
Output: Loan_ID Gender Married Education Self_Employed LoanAmount Loan_Amount_Term Credit_History Loan_Status (Approved) 23 LP001050 NaN Yes Not Graduate No 112.0 360.0 0.0 N 126 LP001448 NaN Yes Graduate No 370.0 360.0 1.0 Y 171 LP001585 NaN Yes Graduate No 700.0 300.0 1.0 Y 188 LP001644 NaN Yes Graduate Yes 168.0 360.0 1.0 Y 314 LP002024 NaN Yes Graduate No 159.0 360.0 1.0 N 334 LP002103 NaN Yes Graduate Yes 182.0 180.0 1.0 Y 460 LP002478 NaN Yes Graduate Yes 160.0 360.0 NaN Y 467 LP002501 NaN Yes Graduate No 110.0 360.0 1.0 Y 477 LP002530 NaN Yes Graduate No 132.0 360.0 0.0 N 507 LP002625 NaN No Graduate No 96.0 360.0 1.0 N 576 LP002872 NaN Yes Graduate No 136.0 360.0 0.0 N 588 LP002925 NaN No Graduate No 94.0 360.0 1.0 Y 592 LP002933 NaN No Graduate Yes 292.0 360.0 1.0 Y
Handling Missing Value:
- Imputation
- Dropping
Imputation:
- Use of fillna() with specific value
#Option 01,Use inplace to change in original Data Frame
df_dm['Gender'].fillna(value='Male',inplace=True)
# Option 02,Assign to same variable instead of usnig inplace
df_dm['Gender']=df_dm['Gender'].fillna(value='Male')
# Option 03,Only value can be used without value= assignment
df_dm['Gender'].fillna('Male',inplace=True)
Note: all missing value in “Gender” column will be filled by “Male”
- Find specific value to fill in missing places or NaN
- For numerical data, if data is normal distribution, mean can be used using .mean()
- For numerical data, if data is skewed or not a normal distribution, the median can be used using .median()
- For categorical data, mode can be used using .mode()
- The .ffill() method, also known as “forward fill,” is used to fill missing values in a dataset by propagating the last valid (non-missing) observation forward to fill the gaps.
- The .bfill() method, also known as “backward fill,” is used to fill missing values in a dataset by propagating the next valid (non-missing) observation backward to fill the gaps.
- If Data sets are big .dropna() can be used to remove a few rows
- Using .loc[] function to impute missing on a specific column
# Fill missing value using mean
# Mean = the average value (the sum of all values divided by number of values).
#Option01 & best practice
x = df_dm['LoanAmount'].mean()
df_dm['LoanAmount'].fillna(x,inplace=True)
#Option02
#df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean(),inplace=True)
#Option03
df_dm['LoanAmount']=df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean())
# Fill missing value using median
# Median = the value in the middle, after you have sorted all values ascending.
#Option01 & Best Practice
x = df_dm['CoapplicantIncome'].median()
df_dm['CoapplicantIncome'].fillna(x,inplace=True)
#Option02
#df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median(),inplace=True)
#Option03
df_dm['CoapplicantIncome']=df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median())
# Fill missing value using mode
# Mode = The value that appears most frequently
#Option01 & Best Practice
x = df_dm['Gender'].mode()[0]
df_dm['Gender'].fillna(x,inplace=True)
#Option02
# In case of Mode ,there might more than one mode value , [0] is used to provide first mode value
df_dm['Gender'].fillna(df_dm['Gender'].mode()[0],inplace=True)
#Option03
df_dm['Gender']=df_dm['Gender'].fillna(df_dm['Gender'].mode()[0])
import pandas as pd
# Sample DataFrame with missing values
data = {
'Date': ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04', '2024-08-05'],
'Temperature': [None, 25, None, 30, None],
'Sales': [100, None, 150, None, 200]
}
df = pd.DataFrame(data)
# Apply forward fill (ffill) first
df_filled = df.ffill()
# Apply backward fill (bfill) next to handle any remaining NaNs
df_filled = df_filled.bfill()
# Use of dropna() to remove rows containing null values
#Option01
df_dm.dropna(inplace=True)
#Option02, Creating new DataFrame removing null values
df_dm_new = df_dm.dropna()
# Using loc function to impute missing on specific column
#.isnull()ensure all columns containing rows only where null value present, 'Credit_History' specified only Credit_History column
df_dm.loc[df_dm['Credit_History'].isnull(),'Credit_History']=1
Encoding Categorical Variables
Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms.
There are two common ways to convert categorical variables into numeric variables:
- Label Encoding: Assign an integer value to each categorical value based on alphabetical order.
For example, suppose we have the following dataset with two variables and we would like to convert the Team variable from a categorical variable into a numeric one:
Using label encoding, we would convert each unique value in the Team column into an integer value based on alphabetical order:
In this example, we can see:
Each “A” value has been converted to 0.
Each “B” value has been converted to 1.
Each “C” value has been converted to 2.
We have successfully converted the Team column from a categorical variable into a numeric variable.
- One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.
Using one hot encoding, we would convert the Team column into new variables that contain only 0 and 1 values.
When using this approach, we create one new column for each unique value in the original categorical variable.
For example, the categorical variable Team had three unique values so we created three new columns in the dataset that all contain 0 or 1 values.
Here’s how to interpret the values in the new columns:
The value in the new Team_A column is 1 if the original value in the Team column was A. Otherwise, the value is 0.
The value in the new Team_B column is 1 if the original value in the Team column was B. Otherwise, the value is 0.
The value in the new Team_C column is 1 if the original value in the Team column was C. Otherwise, the value is 0.
We have successfully converted the Team column from a categorical variable into three numeric variables – sometimes referred to as “dummy” variables.
How to choose technique:
In most scenarios, one hot encoding is the preferred way to convert a categorical variable into a numeric variable because label encoding makes it seem that there is a ranking between values.
The label-encoded data makes it seem like team C is somehow greater or larger than teams B and A since it has a higher numeric value.
This isn’t an issue if the original categorical variable actually is an ordinal variable with a natural ordering or ranking, but in many scenarios, this isn’t the case.
However, one drawback of one hot encoding is that it requires you to make as many new variables as there are unique values in the original categorical variable.
This means that if your categorical variable has 100 unique values, you’ll have to create 100 new variables when using one hot encoding.
Depending on the size of your dataset and the type of variables you’re working with, you may prefer one hot encoding or label encoding.
Python implementation for label encoding
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/04. Data Preprocessing/data.csv')
data.drop('Unnamed: 0',axis=1,inplace=True)
datacopy = data.copy()
datacopy1 = data.copy()
print(data)
print('-------------------------')
print(datacopy)
print('-------------------------')
print(datacopy1)
Output: Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns] ------------------------- Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns] ------------------------- Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns]
from sklearn.preprocessing import LabelEncoder
lc=LabelEncoder()
data.Married=lc.fit_transform(data.Married)
print(data.Married)
Output: 0 0 1 1 2 1 3 1 4 0 .. 609 0 610 1 611 1 612 1 613 0 Name: Married, Length: 614, dtype: int64
Python implementation for one hot encoding:
Approach 1: Using pd.get_dummies() from pandas
- This approach utilizes pandas.get_dummies() function to one-hot encode the categorical variable.
- It directly operates on the DataFrame column and returns a DataFrame with the encoded columns.
- In this case, you are dropping the original column and concatenating the encoded columns to the original DataFrame.
df1=pd.get_dummies(datacopy['Married'],prefix='Married',drop_first=True)
data=pd.concat([datacopy,df1],axis=1).drop(['Married'],axis=1)
print(data)
Output: Gender Married_Yes 0 Male False 1 Male True 2 Male True 3 Male True 4 Male False .. ... ... 609 Female False 610 Male True 611 Male True 612 Male True 613 Female False [614 rows x 2 columns]
# With one line code
datacopy1 = pd.get_dummies(datacopy1,columns=['Gender','Married'],drop_first=True)
datacopy1
Output: Gender_Male Married_Yes 0 True False 1 True True 2 True True 3 True True 4 True False .. ... ... 609 False False 610 True True 611 True True 612 True True 613 False False [614 rows x 2 columns]
Approach 2: Using OneHotEncoder from scikit-learn
This approach utilizes scikit-learn’s OneHotEncoder to encode the categorical variable.
It requires reshaping the input array to a 2D structure before applying fit_transform().
The resulting encoded data will be a numpy array.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
ohe = OneHotEncoder(sparse=False)
# Reshape the input to a 2D array-like structure
datacopy1_reshaped = np.array(datacopy1.Gender).reshape(-1, 1)
datacopy1_encoded = ohe.fit_transform(datacopy1_reshaped)
datacopy1_encoded
Output: array([[0., 1., 0.], [0., 1., 0.], [0., 1., 0.], ..., [0., 1., 0.], [0., 1., 0.], [1., 0., 0.]])
How to choose approaches
The choice between the two approaches depends on factors such as personal preference, ease of use, and compatibility with the rest of your code.
If we are working with pandas DataFrames and prefer a simpler and more concise solution, pd.get_dummies() can be a good option.
However, if you want more control over the encoding process or need to integrate it with other scikit-learn functionality, using OneHotEncoder may be more suitable.
Removing the Outlier:
- Outlier is an abnormal value or abnormal distance from rest of the data points
Python implementation of finding & imputing outliers
# Importing library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
# Define dataset
student_age = [22,25,30,33,24,22,21,22,23,24,26,28,26,29,29,30,31,20,45,15]
Find outliers using z-score
# Defining function
outliers = []
def detect_outliers(data):
threshold = 3 ## 3rd standard deviation from emphirical rule
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
if np.abs(z_score)>threshold:
outliers.append(i)
return outliers
#Finding outlier using created function
detect_outliers(student_age)
Output: [45]
Find outliers using IQR
# sort data
student_age = sorted(student_age)
print("student_age :",student_age)
# calculating q1 & q3
q1,q3 = np.percentile(student_age,[25,75])
print("q1 :",q1,"q3 :",q3)
# calculting iqr
iqr = q3 - q1
print("iqr :",iqr)
# Finding lower bound(min value) and upper bound(max value)
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print("lower_bound :",lower_bound,"upper_bound :",upper_bound)
# Finding outlier
outliers = []
for i in student_age:
if i<lower_bound or i>upper_bound:
outliers.append(i)
print("outliers :",outliers)
Output: student_age : [15, 20, 21, 22, 22, 22, 23, 24, 24, 25, 26, 26, 28, 29, 29, 30, 30, 31, 33, 45] q1 : 22.0 q3 : 29.25 iqr : 7.25 lower_bound : 11.125 upper_bound : 40.125 outliers : [45]
Imputing outlier
student_age1=pd.Series(student_age)
student_age1.loc[student_age1>upper_bound] = np.mean(student_age1)
print(student_age1)
Output: 0 15.00 1 20.00 2 21.00 3 22.00 4 22.00 5 22.00 6 23.00 7 24.00 8 24.00 9 25.00 10 26.00 11 26.00 12 28.00 13 29.00 14 29.00 15 30.00 16 30.00 17 31.00 18 33.00 19 26.25 dtype: float64
Find & see outlier visually using boxplot
import seaborn as sns
#Before outlier removal
plt.subplot(2,2,1)
sns.boxplot(student_age,orient="h")
plt.title('Before outlier removal')
#After outlier removal
plt.subplot(2,2,2)
sns.boxplot(student_age1,orient="h")
plt.title('After outlier removal')
Feature Scaling
It is a technique to standardize the independent features in data in a fixed range or scale. Thus the name Feature Scaling.
Feature Scaling is one of the last steps in the whole life cycle of Feature Engineering.
Once we are done with all the other steps of feature engineering, like encoding variables, handling missing values, etc, then we scale all the variable
All the data gets squeezed to decimal points between -1 and +1.
Why Feature Scaling?
- Real Life Datasets have many features with a wide range of values like for example let’s consider the house price prediction dataset.
- It will have many features like no. of. bedrooms, square feet area of the house, etc.
- As you can guess, the no. of bedrooms will vary between 1 and 5, but the square feet area will range from 500-2000.
- This is a huge difference in the range of both features.
- Without scaling, features with larger units or numerical ranges might dominate the model’s learning process, leading to biased predictions.
- Some machine learning algorithms, especially those that rely on distance calculations or gradients, are sensitive to the scale of the features.
Which machine learning algorithm needs scaling?
- Gradient descent and distance-based algorithms require feature scaling while tree-based algorithms do not require.
Types of Feature Scaling:
Standardization:
- Standard Scaler
Normalization:
- Min Max Scaling
- Mean Normalization
- Max Absolute Scaling
- Robust Scaling etc.
01. Standardization:
Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation.
This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Formula of Standardization: z = (x – μ )/σ, where x = values ,μ = mean ,σ = Standard Deviation
Scaling technique: StandardScaler
Fit_transform to be performed for train data set & transform to be performed for test data set to avoid data leakage
Python Implementation for StandardScaler:
# importing sklearn StandardScaler class which is for Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # creating an instance of the class object
X_new = sc.fit_transform(X)
# plotting the scatterplot of before and after Standardization
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Standardization", fontsize=18)
sns.scatterplot(data = X, color="blue")
#sns.histplot(data=X ,color="red",kde=True)
plt.subplot(1,2,2)
plt.title("Scatterplot After Standardization", fontsize=18)
sns.scatterplot(data = X_new, color="blue")
#sns.histplot(data=X_new ,color="red",kde=True)
plt.tight_layout()
plt.show()
02. Normalization
- Normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.
Min Max Scaling
Min-max normalization is one of the most common ways to normalize data.
For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
Min Max Normalization will perform best when the maximum and minimum value is very distinct and known.
Formula of Min Max Scaling: Xsc=(X−Xmin)/(Xmax−Xmin)
Python Implementation for MinMaxScaler
# importing sklearn Min Max Scaler class which is for Standardization
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler() # creating an instance of the class object
X_new = mm.fit_transform(X) #fit and transforming
# plotting the scatterplot of before and after Min Max Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Min Max Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Min Max Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()
Max Absolute Scaling
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0.
It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
Max Absolute scaling will perform a lot better in sparse data or when most of the values are 0.
Formula of Max Absolute Scaling: Xsc = X /|Xmax|
Python Implementation for MaxAbsScaler
# importing sklearn Min Max Scaler class which is for Max Absolute Scaling
from sklearn.preprocessing import MaxAbsScaler
ma = MaxAbsScaler() # creating an instance of the class object
X_new = ma.fit_transform(X) #fit and transforming
# plotting the scatterplot of before and after Max Absolute Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()
Robust Scaling
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).
The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Robust Scaling is best for data that has outliers
Formula of Robust Scaling: X(new) = (X – X(median)/(IQR)
Python Implementation for RobustScaler
# importing sklearn Min Max Scaler class which is for Robust scaling
from sklearn.preprocessing import RobustScaler
rs = RobustScaler() # creating an instance of the class object
X_new = rs.fit_transform(X) #fit and transforming
# plotting the scatterplot of before and after Robust Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Robust Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Robust Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()
Mean Normalization
It is very similar to Min Max Scaling, just that we use mean to normalize the data. Removes the mean from the data and scales it into max and min values.
Scikitlearn does not have any specific class for mean normalization. However, it can be done by using numpy.
Formula of Mean Normalization: X’ = (X – μ )/(max(X)-min(X))
Python Implementation for Normalizing
# importing sklearn Min Max Scaler class which is for Mean Normalization
from sklearn.preprocessing import normalize
X_new = normalize(X,axis=1) # creating an instance of the class object
# plotting the scatterplot of before and after Normalize Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Normalize Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Normalize Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()
How to choose scaling technique
- If not sure which scaler to use, apply all and check the effect on the models.
- If not understand the data, use standard scaler. It works most of the times.
- If we know the max and min values of the feature, then use min max scaler. Like in CNN.
- If most of the values in the feature column is 0 or sparce matrix, then use Max Absolute Scaling
- If the data has outliers, use Robust Scaling.
Transformations
Some Machine Learning models, like Linear and Logistic regression, assume that the variables follow a normal distribution. More likely, variables in real datasets will follow a skewed distribution.
By applying some transformations to these skewed variables, we can map this skewed distribution to a normal distribution so, this can increase the performance of our models.
Sklearn has three Transformations-
Function Transformation
- Log Transformation
- Reciprocal Transformation
- Square Transformation
- Square Root Transformation
Power Transformation
- Box-Cox Transformation
- Yeo-Johnson Transformation
Quantile transformation
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
from scipy.stats import boxcox
from scipy.stats import yeojohnson
# Loading Data Sets
df= pd.read_csv("/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/04. Data Preprocessing/heart_disease_uci.csv")
Log Transformation & Python implementation:
Generally, these transformations make our data close to a normal distribution but are not able to exactly abide by a normal distribution.
This transformation is not applied to those features which have negative values.
This transformation is mostly applied to right-skewed data.
Convert data from addictive Scale to multiplicative scale i,e, linearly distributed data.
# Creating log normal distribution as we dont have real data
rightskewed =np.random.lognormal(4,1,10000)
plt.subplot(2,2,1)
sns.histplot(rightskewed)
plt.title("Right skewed distribution")
# Log Transformation
Normal_dis_data = np.log(rightskewed)
plt.subplot(2,2,2)
sns.histplot(Normal_dis_data)
plt.title("Normal distribution")
Square Root Transformation & Python Implementation:
This transformation is defined only for positive numbers.
This transformation is weaker than Log Transformation.
This can be used for reducing the skewness of right-skewed data.
import numpy as np
import matplotlib.pyplot as plt
#make this example reproducible
np.random.seed(0)
#create beta distributed random variable with 200 values
data = np.random.beta(a=1, b=5, size=300)
#create sqrt-transformed data
data_sqrt = np.sqrt(data)
#define grid of plots
fig, axs = plt.subplots(nrows=1, ncols=2)
#create histograms
axs[0].hist(data, edgecolor='black')
axs[1].hist(data_sqrt, edgecolor='black')
#add title to each histogram
axs[0].set_title('Original Data')
axs[1].set_title('Square Root Transformed Data')
Reciprocal transformation & Python Implementation:
- This transformation, x will replace by the inverse of x (1/x).
- The reciprocal transformation will give little effect on the shape of the distribution.
- This transformation can be only used for non-zero values.
plt.figure(figsize=(9,3))
plt.subplot(1,4,1)
sns.histplot(df["trestbps"],kde=True)
plt.title("DISTRIBUTION BEFORE",)
#Transformation
plt.subplot(1,4,3)
reciprocal_data = 1/df.trestbps
sns.histplot(reciprocal_data,kde =True)
plt.title("DISTRIBUTION AFTER ",)
Box-cox transformation & Python Implementation:
Box-cox requires the input data to be strictly positive(not even zero is acceptable).
Mathematical formula for Box-cox transformation:
All the values of lambda vary from -5 to 5 are considered and the best value for the data is selected.
The “Best” value is one that results in the best skewness of the distribution.
Log transformation will take place when we have lambda is zero.
plt.figure(figsize=(15,4))
plt.subplot(1,4,1)
sns.histplot(df["trestbps"],kde=True)
plt.title("DISTRIBUTION BEFORE",)
# Transformation
plt.subplot(1,4,3)
bcx_target, lamda = boxcox(df["trestbps"])
sns.histplot(bcx_target,kde=True)
plt.title("DISTRIBUTION AFTER BOX-COX ",)
YEO-JOHNSON Transformation & Python Implementation:
This is one of the older transformation technique which is very similar to Box-cox transformation but does not require the values to be strictly positive.
This transformation is also having the ability to make the distribution more symmetric.
For features that have zeroes or negative values, Yeo-Johnson comes to the rescue in place of Box-cox
plt.figure(figsize=(15,4))
plt.subplot(1,4,1)
sns.distplot(df["chol"])
plt.title("DISTRIBUTION BEFORE",)
# Transformation
plt.subplot(1,4,3)
yf_target, lam = yeojohnson(df["chol"])
sns.distplot(yf_target)
plt.title("DISTRIBUTION AFTER YEO JHONSON ",)
Balancing the Data:
Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you’ll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).
Some Machine Learning algorithms are more sensitive toward imbalanced data, such as Logistic Regression and Support Vector Machine. However, some algorithms tackle this issue themselves, such as Random Forest and XGBoost.
Two main causes for the imbalance dataset:
- Data sampling (Biased Sampling,Measurement Errors etc)
- Properties of the domain
Sampling technique is used to deal with imbalanced data.
Sampling: The idea behind sampling is to create new samples or choose some records from the whole data set.
There are two sampling techniques available to handle the imbalanced data:
- Under Sampling
- Over Sampling
Python Implementation for balancing the data with different technique:
# import necessary libraries
import pandas as pd
import numpy as np
import imblearn
import matplotlib.pyplot as plt
import seaborn as sns
# Creating imbalanced data set
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /05_KNN/KNN Class/diabetes.csv')
# Creating Independent & dependent variable
X = data.iloc[:,0:8]
y = data.iloc[:,-1]
y.value_counts() # To see imbalanced data set
print((y.value_counts()/y.value_counts().sum())*100)
Output: Outcome 0 65.8 1 34.2 Name: count, dtype: float64
# Installing library
!pip install imblearn
Output: Collecting imblearn Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.3) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4) Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.3.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0) Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB) Installing collected packages: imblearn Successfully installed imblearn-0.0
Undersampling:
- It Balance the class distribution for skewed class distribution.
- The basic Undersampling technique removes the examples randomly from the majority class, referred to as RandomUnderSampler.
- Cons:There is a risk of losing useful or important information that could determine the decision boundary between the classes.
# Use RandomUnderSampler to balance the data
from imblearn.under_sampling import RandomUnderSampler
model = RandomUnderSampler() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({0: 684, 1: 684}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (1368, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 0 684 1 684 Name: count, dtype: int64
Near Miss Undersampling:
This technique selects the data points based on the distance between majority and minority class examples.
It has three versions of itself, and each of these considers the different neighbors from the majority class.
Version 1 keeps examples with a minimum average distance to the nearest records of the minority class.
Version 2 selects rows with a minimum average distance to the furthest records of the minority class.
Version 3 keeps examples from the majority class for each closest record in the minority class.
Among these, version 3 is more accurate since it considers examples of the majority class that are on the decision boundary.
# Use Near Miss Undersampling to balance the data
from imblearn.under_sampling import NearMiss
model = NearMiss() ## object creation it has version = 1,2,3
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({0: 684, 1: 684}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (1368, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 0 684 1 684 Name: count, dtype: int64
Condensed Nearest Neighbor (CNN) Undersampling:
This technique aspires to a subset of a collection of samples that minimizes the model loss.
These examples are stores in a store that then consists of examples from the minority class and incorrectly classified examples from the majority class.
# Use Condensed Nearest Neighbor (CNN) to balance the data
from imblearn.under_sampling import CondensedNearestNeighbour
model = CondensedNearestNeighbour() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({1: 684, 0: 232}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (916, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 1 684 0 232 Name: count, dtype: int64
Tomek Links Undersampling:
- This technique is the modified version of CNN in which the redundant examples get selected randomly for deletion from the majority class.
- Since it uses redundant examples, it barely balances the data.
# Use Tomek Links to balance the data
from imblearn.under_sampling import TomekLinks
model = TomekLinks() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({0: 1316, 1: 684}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (2000, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 0 1316 1 684 Name: count, dtype: int64
Neighborhood Cleaning Undersampling:
- This approach is a combination of CNN and ENN techniques.
- Initially, it selects all the minority class examples.
- Then ENN identifies the ambiguous samples to remove from the majority class. – Then CNN deletes the misclassified examples against the store if the majority class has more than half of the minority class examples.
# Use Neighborhood Cleaning rule to balance the data
from imblearn.under_sampling import NeighbourhoodCleaningRule
model = NeighbourhoodCleaningRule() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({0: 1219, 1: 684}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (1903, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 0 1219 1 684 Name: count, dtype: int64
Oversampling:
Oversampling focuses on increasing minority class samples.
Duplicate the examples to increase the minority class samples.
Although it balances the data, it does not provide additional information to the classification model.
Therefore synthesizing new examples using an appropriate technique is necessary. Here SMOTE comes into the picture.
SMOTE:
Synthetic Minority Over-sampling Technique (SMOTE) is a technique that generates new observations by interpolating between observations in the original dataset.
Interpolation is done with the help of KNN alogrithm
SMOTE picks an instance randomly from the minority class. Then it finds its k nearest neighbors from the minority class itself. Then one of the neighbors gets chosen randomly and draws the line between these two instances. Then new synthetic examples are generated using a convex combination of these two instances.”
Can Not SMOTE the testing data. Because when the model is on production,the data may or may not be balanced one.
# Use SMOTE to balance the data
from imblearn.over_sampling import SMOTE
model = SMOTE() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({1: 1316, 0: 1316}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (2632, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 1 1316 0 1316 Name: count, dtype: int64
Borderline-SMOTE:
- This SMOTE extension selects the minority class instance that is misclassified with a k-nearest neighbor (KNN) classifier.
- Since borderline or distant examples are more tend to misclassified.
# Use BorderlineSMOTE to balance the data
from imblearn.over_sampling import BorderlineSMOTE
model = BorderlineSMOTE() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({1: 1316, 0: 1316}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (2632, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 1 1316 0 1316 Name: count, dtype: int64
Borderline-SMOTE SVM:
- This method selects the misclassified instances of Support Vector Machine (SVM) instead of KNN.
# Use Borderline-SMOTE SVM to balance the data
from imblearn.over_sampling import SVMSMOTE
model = SVMSMOTE() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({1: 1316, 0: 1316}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (2632, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 1 1316 0 1316 Name: count, dtype: int64
Adaptive Synthetic Sampling (ADASYN):
This approach works according to the density of the minority class instances. Generating new samples is inversely proportional to the density of the minority class samples.
It generates more samples in the feature space region where minority class examples density is low or none and fewer samples in the high-density space.
# Use ADASYNM to balance the data
from imblearn.over_sampling import ADASYN
model = ADASYN() ## object creation
X_bal,y_bal = model.fit_resample(X,y)
# See comparison between original data & balanced data
from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())
Output: Actual Classes Counter({0: 1316, 1: 684}) Balanced Classes Counter({1: 1344, 0: 1316}) ----------------------------------------------------- Shape of actual independent variable : (2000, 8) Shape of balanced independent variable : (2660, 8) ----------------------------------------------------- Count of class of y variable : Outcome 0 1316 1 684 Name: count, dtype: int64 Count of class of y_bal variable : Outcome 1 1344 0 1316 Name: count, dtype: int64
What is difference between fit , transform & fit_transform in machine learning?
In machine learning, fit, transform, and fit_transform are methods used in the context of feature scaling, data preprocessing, and model training. Here’s what each of them does:
fit:
- In the context of preprocessing, the fit method is used to compute the necessary parameters needed to perform some transformation on the data.
- For instance, if you’re using a feature scaling technique like standardization (subtracting the mean and dividing by the standard deviation), you use the fit method to calculate the mean and standard deviation from your data.
- These parameters are then used for transforming the data.
#Example:
X_train = [[23,456,12,23,4,56,78,987,45,12,22,32,12]]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Computes mean and standard deviation from X_train
Output: StandardScaler StandardScaler()
transform:
Once you have computed the parameters using the fit method, you can use the transform method to apply the transformation to a particular dataset.
This is useful when you have a separate dataset (e.g., test data) that you want to transform using the same parameters computed from the training data.
#Example:
X_test = [[12,23,45,65,76,87,23,23,23,4,56,45,12]]
X_test_scaled = scaler.transform(X_test) # Applies the same transformation to X_test using mean and standard deviation computed from X_train
X_test_scaled
Output: array([[ -11., -433., 33., 42., 72., 31., -55., -964., -22., -8., 34., 13., 0.]])
fit_transform:
- This is a shorthand method that combines the fit and transform steps.
- It computes the parameters and applies the transformation in one step.
- It is often more efficient than calling fit and then transform separately, especially for large datasets, as it can sometimes optimize the computation.
#Example:
X_train_scaled = scaler.fit_transform(X_train) # Computes mean and standard deviation from X_train and applies the transformation in one step
In summary:
- Use fit to compute parameters from the training data.
- Use transform to apply the computed parameters to any dataset.
- Use fit_transform to perform both steps in one go, typically on the training data for efficiency.