Data Preprocessing:
Data preprocessing involves cleaning and transforming raw data to make it suitable for analysis.
This can include tasks such as removing missing values, scaling numerical features, encoding categorical variables, etc.
The goal of data preprocessing is to prepare the data for modeling by ensuring it is in the correct format, free of errors and inconsistencies, and ready for further analysis.
Basic Techniques:
- Removing unwanted column
- Removing duplicated value
- Imputing missing values
- Encoding categorical variables
- Removing the outlier
- Data normalization/scaling
- Transformation
- Balancing the data
When to do Preprocessing:
It is generally recommended to perform data preprocessing after splitting the data into training and testing sets. Here’s the corrected order of operations:
Split the original dataset into training and testing sets. This should be done before any preprocessing steps.
Perform data preprocessing steps, such as handling missing values, encoding categorical variables, feature scaling, or any other necessary transformations, on the training set only. Remember to keep track of the preprocessing steps applied.
Apply the same preprocessing steps that were performed on the training set to the testing set. This ensures that the testing set is processed in the same way as the training set, allowing for a fair evaluation of the model’s performance.
The main reason for this order is to avoid any data leakage from the testing set into the training set. By preprocessing the data separately for each set, you ensure that the model is trained and evaluated on independent and unbiased data.
It’s important to note that some preprocessing steps, such as calculating statistics for imputation or feature scaling, may require information from the entire dataset. In such cases, it is still recommended to calculate those statistics using only the training set and then apply them to both the training and testing sets.
Overall, the correct order is to split the data first, then perform preprocessing on the training set, and finally apply the same preprocessing steps to the testing set.
Removing Unwanted Column:
- Sometimes, we need to remove some columns using .drop() like id column, Serial column etc
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Gender': ['Female', 'Male', 'Male', 'Male'],
'Salary': [70000, 80000, 90000, 100000],
'Unwanted_Column': [1, 2, 3, 4] # This is the column we want to remove
}
df = pd.DataFrame(data)
# Remove a single column
df_cleaned = df.drop(columns=['Unwanted_Column'])
# Alternatively, remove multiple columns (e.g., 'Gender' and 'Unwanted_Column')
# df_cleaned = df.drop(columns=['Gender', 'Unwanted_Column'])
Removing Duplicated Value:
Use the .duplicated() & .duplicated().sum() method to identify the duplicated rows in your dataset.
Once you have identified the duplicates, remove them using the .drop_duplicates() method.
This will keep only the first occurrence of each unique value and eliminate subsequent duplicates.
Identify Duplicate Rows:
#Creating dummy dataframe
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'David'],
'Age': [25, 30, 35, 25, 30, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Houston']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
#Identify Duplicate Rows:
duplicates = df.duplicated()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)
duplicates = df.duplicated().sum()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)
Output: Original DataFrame: Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 Alice 25 New York 4 Bob 30 Los Angeles 5 David 40 Houston Duplicate Rows (Boolean Series): 0 False 1 False 2 False 3 True 4 True 5 False dtype: bool Duplicate Rows (Boolean Series): 2
Display Only Duplicate Rows:
duplicate_rows = df[df.duplicated()]
print("\nDuplicate Rows (DataFrame):")
print(duplicate_rows)
Output: Duplicate Rows (DataFrame): Name Age City 3 Alice 25 New York 4 Bob 30 Los Angeles
Remove Duplicate Rows:
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
Output: DataFrame after removing duplicates: Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 5 David 40 Houston
Imputing Missing Values:
- A null value or missing value in the context of data analysis refers to an absence of data in a dataset.
- This means that a specific entry or observation for a certain variable (column) is not available or hasn’t been recorded.
Checking the missing value:
- Import data set
import numpy as np
import pandas as pd
df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /Test_loan_approved.csv')
- Missing (null) values in a DataFrame and get a summary of them.
# Checking the missing value
summary = df_dm.isnull().sum()
print(summary)
# Checking the missing value of single column
missing_value_single = df_dm.Gender.isnull().sum()
print("\nTotal number of missing value in gender: ",missing_value_single)
Output: Loan_ID 0 Gender 13 Married 3 Education 0 Self_Employed 32 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Loan_Status (Approved) 0 dtype: int64 Total number of missing value in gender: 13
- Getting the indexes of row where values are missing in specific column using where
missing_index1 = np.where(df_dm.Gender.isnull()) #or
print(missing_index1,)
missing_index2 =np.where(df_dm.Gender.isnull()==True)
print("\n",missing_index2)
Output: (array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),) (array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),)
- Getting the actual data(row) from the indexes
df_dm.loc[missing_index2]
Output: Loan_ID Gender Married Education Self_Employed LoanAmount Loan_Amount_Term Credit_History Loan_Status (Approved) 23 LP001050 NaN Yes Not Graduate No 112.0 360.0 0.0 N 126 LP001448 NaN Yes Graduate No 370.0 360.0 1.0 Y 171 LP001585 NaN Yes Graduate No 700.0 300.0 1.0 Y 188 LP001644 NaN Yes Graduate Yes 168.0 360.0 1.0 Y 314 LP002024 NaN Yes Graduate No 159.0 360.0 1.0 N 334 LP002103 NaN Yes Graduate Yes 182.0 180.0 1.0 Y 460 LP002478 NaN Yes Graduate Yes 160.0 360.0 NaN Y 467 LP002501 NaN Yes Graduate No 110.0 360.0 1.0 Y 477 LP002530 NaN Yes Graduate No 132.0 360.0 0.0 N 507 LP002625 NaN No Graduate No 96.0 360.0 1.0 N 576 LP002872 NaN Yes Graduate No 136.0 360.0 0.0 N 588 LP002925 NaN No Graduate No 94.0 360.0 1.0 Y 592 LP002933 NaN No Graduate Yes 292.0 360.0 1.0 Y
- Getting missing values by loc function based on single column
df_dm.loc[df_dm['Gender'].isnull()]
Output: Loan_ID Gender Married Education Self_Employed LoanAmount Loan_Amount_Term Credit_History Loan_Status (Approved) 23 LP001050 NaN Yes Not Graduate No 112.0 360.0 0.0 N 126 LP001448 NaN Yes Graduate No 370.0 360.0 1.0 Y 171 LP001585 NaN Yes Graduate No 700.0 300.0 1.0 Y 188 LP001644 NaN Yes Graduate Yes 168.0 360.0 1.0 Y 314 LP002024 NaN Yes Graduate No 159.0 360.0 1.0 N 334 LP002103 NaN Yes Graduate Yes 182.0 180.0 1.0 Y 460 LP002478 NaN Yes Graduate Yes 160.0 360.0 NaN Y 467 LP002501 NaN Yes Graduate No 110.0 360.0 1.0 Y 477 LP002530 NaN Yes Graduate No 132.0 360.0 0.0 N 507 LP002625 NaN No Graduate No 96.0 360.0 1.0 N 576 LP002872 NaN Yes Graduate No 136.0 360.0 0.0 N 588 LP002925 NaN No Graduate No 94.0 360.0 1.0 Y 592 LP002933 NaN No Graduate Yes 292.0 360.0 1.0 Y
Handling Missing Value:
- Imputation
- Dropping
Imputation:
- Use of fillna() with specific value
#Option 01,Use inplace to change in original Data Frame
df_dm['Gender'].fillna(value='Male',inplace=True)
# Option 02,Assign to same variable instead of usnig inplace
df_dm['Gender']=df_dm['Gender'].fillna(value='Male')
# Option 03,Only value can be used without value= assignment
df_dm['Gender'].fillna('Male',inplace=True)
Note: all missing value in “Gender” column will be filled by “Male”
- Find specific value to fill in missing places or NaN
- For numerical data, if data is normal distribution, mean can be used using .mean()
- For numerical data, if data is skewed or not a normal distribution, the median can be used using .median()
- For categorical data, mode can be used using .mode()
- The .ffill() method, also known as “forward fill,” is used to fill missing values in a dataset by propagating the last valid (non-missing) observation forward to fill the gaps.
- The .bfill() method, also known as “backward fill,” is used to fill missing values in a dataset by propagating the next valid (non-missing) observation backward to fill the gaps.
- If Data sets are big .dropna() can be used to remove a few rows
- Using .loc[] function to impute missing on a specific column
# Fill missing value using mean
# Mean = the average value (the sum of all values divided by number of values).
#Option01 & best practice
x = df_dm['LoanAmount'].mean()
df_dm['LoanAmount'].fillna(x,inplace=True)
#Option02
#df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean(),inplace=True)
#Option03
df_dm['LoanAmount']=df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean())
# Fill missing value using median
# Median = the value in the middle, after you have sorted all values ascending.
#Option01 & Best Practice
x = df_dm['CoapplicantIncome'].median()
df_dm['CoapplicantIncome'].fillna(x,inplace=True)
#Option02
#df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median(),inplace=True)
#Option03
df_dm['CoapplicantIncome']=df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median())
# Fill missing value using mode
# Mode = The value that appears most frequently
#Option01 & Best Practice
x = df_dm['Gender'].mode()[0]
df_dm['Gender'].fillna(x,inplace=True)
#Option02
# In case of Mode ,there might more than one mode value , [0] is used to provide first mode value
df_dm['Gender'].fillna(df_dm['Gender'].mode()[0],inplace=True)
#Option03
df_dm['Gender']=df_dm['Gender'].fillna(df_dm['Gender'].mode()[0])
import pandas as pd
# Sample DataFrame with missing values
data = {
'Date': ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04', '2024-08-05'],
'Temperature': [None, 25, None, 30, None],
'Sales': [100, None, 150, None, 200]
}
df = pd.DataFrame(data)
# Apply forward fill (ffill) first
df_filled = df.ffill()
# Apply backward fill (bfill) next to handle any remaining NaNs
df_filled = df_filled.bfill()
# Use of dropna() to remove rows containing null values
#Option01
df_dm.dropna(inplace=True)
#Option02, Creating new DataFrame removing null values
df_dm_new = df_dm.dropna()
# Using loc function to impute missing on specific column
#.isnull()ensure all columns containing rows only where null value present, 'Credit_History' specified only Credit_History column
df_dm.loc[df_dm['Credit_History'].isnull(),'Credit_History']=1
Encoding Categorical Variables
Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms.
There are two common ways to convert categorical variables into numeric variables:
- Label Encoding: Assign an integer value to each categorical value based on alphabetical order.
For example, suppose we have the following dataset with two variables and we would like to convert the Team variable from a categorical variable into a numeric one:
Using label encoding, we would convert each unique value in the Team column into an integer value based on alphabetical order:
In this example, we can see:
Each “A” value has been converted to 0.
Each “B” value has been converted to 1.
Each “C” value has been converted to 2.
We have successfully converted the Team column from a categorical variable into a numeric variable.
- One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.
Using one hot encoding, we would convert the Team column into new variables that contain only 0 and 1 values.
When using this approach, we create one new column for each unique value in the original categorical variable.
For example, the categorical variable Team had three unique values so we created three new columns in the dataset that all contain 0 or 1 values.
Here’s how to interpret the values in the new columns:
The value in the new Team_A column is 1 if the original value in the Team column was A. Otherwise, the value is 0.
The value in the new Team_B column is 1 if the original value in the Team column was B. Otherwise, the value is 0.
The value in the new Team_C column is 1 if the original value in the Team column was C. Otherwise, the value is 0.
We have successfully converted the Team column from a categorical variable into three numeric variables – sometimes referred to as “dummy” variables.
How to choose technique:
In most scenarios, one hot encoding is the preferred way to convert a categorical variable into a numeric variable because label encoding makes it seem that there is a ranking between values.
The label-encoded data makes it seem like team C is somehow greater or larger than teams B and A since it has a higher numeric value.
This isn’t an issue if the original categorical variable actually is an ordinal variable with a natural ordering or ranking, but in many scenarios, this isn’t the case.
However, one drawback of one hot encoding is that it requires you to make as many new variables as there are unique values in the original categorical variable.
This means that if your categorical variable has 100 unique values, you’ll have to create 100 new variables when using one hot encoding.
Depending on the size of your dataset and the type of variables you’re working with, you may prefer one hot encoding or label encoding.
Python implementation for label encoding
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/04. Data Preprocessing/data.csv')
data.drop('Unnamed: 0',axis=1,inplace=True)
datacopy = data.copy()
datacopy1 = data.copy()
print(data)
print('-------------------------')
print(datacopy)
print('-------------------------')
print(datacopy1)
Output: Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns] ------------------------- Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns] ------------------------- Gender Married 0 Male No 1 Male Yes 2 Male Yes 3 Male Yes 4 Male No .. ... ... 609 Female No 610 Male Yes 611 Male Yes 612 Male Yes 613 Female No [614 rows x 2 columns]
from sklearn.preprocessing import LabelEncoder
lc=LabelEncoder()
data.Married=lc.fit_transform(data.Married)
print(data.Married)
Output: 0 0 1 1 2 1 3 1 4 0 .. 609 0 610 1 611 1 612 1 613 0 Name: Married, Length: 614, dtype: int64
Python implementation for one hot encoding:
Approach 1: Using pd.get_dummies() from pandas
- This approach utilizes pandas.get_dummies() function to one-hot encode the categorical variable.
- It directly operates on the DataFrame column and returns a DataFrame with the encoded columns.
- In this case, you are dropping the original column and concatenating the encoded columns to the original DataFrame.
df1=pd.get_dummies(datacopy['Married'],prefix='Married',drop_first=True)
data=pd.concat([datacopy,df1],axis=1).drop(['Married'],axis=1)
print(data)
Output: Gender Married_Yes 0 Male False 1 Male True 2 Male True 3 Male True 4 Male False .. ... ... 609 Female False 610 Male True 611 Male True 612 Male True 613 Female False [614 rows x 2 columns]
# With one line code
datacopy1 = pd.get_dummies(datacopy1,columns=['Gender','Married'],drop_first=True)
datacopy1
Output: Gender_Male Married_Yes 0 True False 1 True True 2 True True 3 True True 4 True False .. ... ... 609 False False 610 True True 611 True True 612 True True 613 False False [614 rows x 2 columns]
Approach 2: Using OneHotEncoder from scikit-learn
This approach utilizes scikit-learn’s OneHotEncoder to encode the categorical variable.
It requires reshaping the input array to a 2D structure before applying fit_transform().
The resulting encoded data will be a numpy array.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
ohe = OneHotEncoder(sparse=False)
# Reshape the input to a 2D array-like structure
datacopy1_reshaped = np.array(datacopy1.Gender).reshape(-1, 1)
datacopy1_encoded = ohe.fit_transform(datacopy1_reshaped)
datacopy1_encoded
Output: array([[0., 1., 0.], [0., 1., 0.], [0., 1., 0.], ..., [0., 1., 0.], [0., 1., 0.], [1., 0., 0.]])
How to choose approaches
The choice between the two approaches depends on factors such as personal preference, ease of use, and compatibility with the rest of your code.
If we are working with pandas DataFrames and prefer a simpler and more concise solution, pd.get_dummies() can be a good option.
However, if you want more control over the encoding process or need to integrate it with other scikit-learn functionality, using OneHotEncoder may be more suitable.
Removing the Outlier:
- Outlier is an abnormal value or abnormal distance from rest of the data points
Python implementation of finding & imputing outliers
# Importing library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
# Define dataset
student_age = [22,25,30,33,24,22,21,22,23,24,26,28,26,29,29,30,31,20,45,15]
Find outliers using z-score
# Defining function
outliers = []
def detect_outliers(data):
threshold = 3 ## 3rd standard deviation from emphirical rule
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
if np.abs(z_score)>threshold:
outliers.append(i)
return outliers
#Finding outlier using created function
detect_outliers(student_age)
Output: [45]
Find outliers using IQR
# sort data
student_age = sorted(student_age)
print("student_age :",student_age)
# calculating q1 & q3
q1,q3 = np.percentile(student_age,[25,75])
print("q1 :",q1,"q3 :",q3)
# calculting iqr
iqr = q3 - q1
print("iqr :",iqr)
# Finding lower bound(min value) and upper bound(max value)
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print("lower_bound :",lower_bound,"upper_bound :",upper_bound)
# Finding outlier
outliers = []
for i in student_age:
if i<lower_bound or i>upper_bound:
outliers.append(i)
print("outliers :",outliers)
Output: student_age : [15, 20, 21, 22, 22, 22, 23, 24, 24, 25, 26, 26, 28, 29, 29, 30, 30, 31, 33, 45] q1 : 22.0 q3 : 29.25 iqr : 7.25 lower_bound : 11.125 upper_bound : 40.125 outliers : [45]
Imputing outlier
student_age1=pd.Series(student_age)
student_age1.loc[student_age1>upper_bound] = np.mean(student_age1)
print(student_age1)
Output: 0 15.00 1 20.00 2 21.00 3 22.00 4 22.00 5 22.00 6 23.00 7 24.00 8 24.00 9 25.00 10 26.00 11 26.00 12 28.00 13 29.00 14 29.00 15 30.00 16 30.00 17 31.00 18 33.00 19 26.25 dtype: float64
Find & see outlier visually using boxplot
import seaborn as sns
#Before outlier removal
plt.subplot(2,2,1)
sns.boxplot(student_age,orient="h")
plt.title('Before outlier removal')
#After outlier removal
plt.subplot(2,2,2)
sns.boxplot(student_age1,orient="h")
plt.title('After outlier removal')
Feature Scaling
It is a technique to standardize the independent features in data in a fixed range or scale. Thus the name Feature Scaling.
Feature Scaling is one of the last steps in the whole life cycle of Feature Engineering.
Once we are done with all the other steps of feature engineering, like encoding variables, handling missing values, etc, then we scale all the variable
All the data gets squeezed to decimal points between -1 and +1.
Why Feature Scaling?
- Real Life Datasets have many features with a wide range of values like for example let’s consider the house price prediction dataset.
- It will have many features like no. of. bedrooms, square feet area of the house, etc.
- As you can guess, the no. of bedrooms will vary between 1 and 5, but the square feet area will range from 500-2000.
- This is a huge difference in the range of both features.
- Without scaling, features with larger units or numerical ranges might dominate the model’s learning process, leading to biased predictions.
- Some machine learning algorithms, especially those that rely on distance calculations or gradients, are sensitive to the scale of the features.
Which machine learning algorithm needs scaling?
- Gradient descent and distance-based algorithms require feature scaling while tree-based algorithms do not require.
Types of Feature Scaling:
Standardization:
- Standard Scaler
Normalization:
- Min Max Scaling
- Mean Normalization
- Max Absolute Scaling
- Robust Scaling etc.
01. Standardization:
Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation.
This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Formula of Standardization: z = (x – μ )/σ, where x = values ,μ = mean ,σ = Standard Deviation
Scaling technique: StandardScaler
Fit_transform to be performed for train data set & transform to be performed for test data set to avoid data leakage
Python Implementation for StandardScaler:
# importing sklearn StandardScaler class which is for Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # creating an instance of the class object
X_new = sc.fit_transform(X)
# plotting the scatterplot of before and after Standardization
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Standardization", fontsize=18)
sns.scatterplot(data = X, color="blue")
#sns.histplot(data=X ,color="red",kde=True)
plt.subplot(1,2,2)
plt.title("Scatterplot After Standardization", fontsize=18)
sns.scatterplot(data = X_new, color="blue")
#sns.histplot(data=X_new ,color="red",kde=True)
plt.tight_layout()
plt.show()
02. Normalization
- Normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.
Min Max Scaling
Min-max normalization is one of the most common ways to normalize data.
For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
Min Max Normalization will perform best when the maximum and minimum value is very distinct and known.
Formula of Min Max Scaling: Xsc=(X−Xmin)/(Xmax−Xmin)
Python Implementation for MinMaxScaler
# importing sklearn Min Max Scaler class which is for Standardization
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler() # creating an instance of the class object
X_new = mm.fit_transform(X) #fit and transforming
# plotting the scatterplot of before and after Min Max Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Min Max Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Min Max Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()
Max Absolute Scaling
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0.
It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
Max Absolute scaling will perform a lot better in sparse data or when most of the values are 0.
Formula of Max Absolute Scaling: Xsc = X /|Xmax|
Python Implementation for MaxAbsScaler
# importing sklearn Min Max Scaler class which is for Max Absolute Scaling
from sklearn.preprocessing import MaxAbsScaler
ma = MaxAbsScaler() # creating an instance of the class object
X_new = ma.fit_transform(X) #fit and transforming
# plotting the scatterplot of before and after Max Absolute Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()