Data Preprocessing

Data Preprocessing:

Data preprocessing involves cleaning and transforming raw data to make it suitable for analysis.
This can include tasks such as removing missing values, scaling numerical features, encoding categorical variables, etc.
The goal of data preprocessing is to prepare the data for modeling by ensuring it is in the correct format, free of errors and inconsistencies, and ready for further analysis.

Basic Techniques:

Removing unwanted column
Removing duplicated value
Imputing missing values
Encoding categorical variables
Removing the outlier
Data normalization/scaling
Transformation
Balancing the data

When to do Preprocessing:

It is generally recommended to perform data preprocessing after splitting the data into training and testing sets. Here’s the corrected order of operations:

Split the original dataset into training and testing sets. This should be done before any preprocessing steps.
Perform data preprocessing steps, such as handling missing values, encoding categorical variables, feature scaling, or any other necessary transformations, on the training set only. Remember to keep track of the preprocessing steps applied.
Apply the same preprocessing steps that were performed on the training set to the testing set. This ensures that the testing set is processed in the same way as the training set, allowing for a fair evaluation of the model’s performance.
The main reason for this order is to avoid any data leakage from the testing set into the training set. By preprocessing the data separately for each set, you ensure that the model is trained and evaluated on independent and unbiased data.
It’s important to note that some preprocessing steps, such as calculating statistics for imputation or feature scaling, may require information from the entire dataset. In such cases, it is still recommended to calculate those statistics using only the training set and then apply them to both the training and testing sets.
Overall, the correct order is to split the data first, then perform preprocessing on the training set, and finally apply the same preprocessing steps to the testing set.

Removing Unwanted Column:

Sometimes, we need to remove some columns using .drop() like id column, Serial column etc

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Gender': ['Female', 'Male', 'Male', 'Male'],
    'Salary': [70000, 80000, 90000, 100000],
    'Unwanted_Column': [1, 2, 3, 4]  # This is the column we want to remove
}

df = pd.DataFrame(data)

# Remove a single column
df_cleaned = df.drop(columns=['Unwanted_Column'])

# Alternatively, remove multiple columns (e.g., 'Gender' and 'Unwanted_Column')
# df_cleaned = df.drop(columns=['Gender', 'Unwanted_Column'])

Removing Duplicated Value:

Use the .duplicated() & .duplicated().sum() method to identify the duplicated rows in your dataset.
Once you have identified the duplicates, remove them using the .drop_duplicates() method.
This will keep only the first occurrence of each unique value and eliminate subsequent duplicates.

Identify Duplicate Rows:

#Creating dummy dataframe

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'David'],
    'Age': [25, 30, 35, 25, 30, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Houston']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

#Identify Duplicate Rows:

duplicates = df.duplicated()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)

duplicates = df.duplicated().sum()
print("\nDuplicate Rows (Boolean Series):")
print(duplicates)

Output:
Original DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    Alice   25     New York
4      Bob   30  Los Angeles
5    David   40      Houston

Duplicate Rows (Boolean Series):
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

Duplicate Rows (Boolean Series):
2

Display Only Duplicate Rows:

duplicate_rows = df[df.duplicated()]
print("\nDuplicate Rows (DataFrame):")
print(duplicate_rows)

Output:
Duplicate Rows (DataFrame):
    Name  Age         City
3  Alice   25     New York
4    Bob   30  Los Angeles

Remove Duplicate Rows:

df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Output:
DataFrame after removing duplicates:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
5    David   40      Houston

Imputing Missing Values:

A null value or missing value in the context of data analysis refers to an absence of data in a dataset.
This means that a specific entry or observation for a certain variable (column) is not available or hasn’t been recorded.

Checking the missing value:

Import data set

import numpy as np
import pandas as pd

df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /Test_loan_approved.csv')

Missing (null) values in a DataFrame and get a summary of them.

# Checking the missing value

summary = df_dm.isnull().sum()
print(summary)

# Checking the missing value of single column

missing_value_single = df_dm.Gender.isnull().sum()
print("\nTotal number of missing value in gender: ",missing_value_single)

Output:
Loan_ID                    0
Gender                    13
Married                    3
Education                  0
Self_Employed             32
LoanAmount                22
Loan_Amount_Term          14
Credit_History            50
Loan_Status (Approved)     0
dtype: int64

Total number of missing value in gender:  13

Getting the indexes of row where values are missing in specific column using where

missing_index1 = np.where(df_dm.Gender.isnull())  #or
print(missing_index1,)

missing_index2 =np.where(df_dm.Gender.isnull()==True)
print("\n",missing_index2)

Output:
(array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),)

 (array([ 23, 126, 171, 188, 314, 334, 460, 467, 477, 507, 576, 588, 592]),)

Getting the actual data(row) from the indexes

df_dm.loc[missing_index2]

Output:
     Loan_ID Gender Married     Education Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History Loan_Status (Approved)
23   LP001050    NaN     Yes  Not Graduate            No       112.0             360.0             0.0                      N
126  LP001448    NaN     Yes      Graduate            No       370.0             360.0             1.0                      Y
171  LP001585    NaN     Yes      Graduate            No       700.0             300.0             1.0                      Y
188  LP001644    NaN     Yes      Graduate           Yes       168.0             360.0             1.0                      Y
314  LP002024    NaN     Yes      Graduate            No       159.0             360.0             1.0                      N
334  LP002103    NaN     Yes      Graduate           Yes       182.0             180.0             1.0                      Y
460  LP002478    NaN     Yes      Graduate           Yes       160.0             360.0             NaN                      Y
467  LP002501    NaN     Yes      Graduate            No       110.0             360.0             1.0                      Y
477  LP002530    NaN     Yes      Graduate            No       132.0             360.0             0.0                      N
507  LP002625    NaN      No      Graduate            No        96.0             360.0             1.0                      N
576  LP002872    NaN     Yes      Graduate            No       136.0             360.0             0.0                      N
588  LP002925    NaN      No      Graduate            No        94.0             360.0             1.0                      Y
592  LP002933    NaN      No      Graduate           Yes       292.0             360.0             1.0                      Y

Getting missing values by loc function based on single column

df_dm.loc[df_dm['Gender'].isnull()]

Output:
    Loan_ID Gender Married     Education Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History Loan_Status (Approved)
23   LP001050    NaN     Yes  Not Graduate            No       112.0             360.0             0.0                      N
126  LP001448    NaN     Yes      Graduate            No       370.0             360.0             1.0                      Y
171  LP001585    NaN     Yes      Graduate            No       700.0             300.0             1.0                      Y
188  LP001644    NaN     Yes      Graduate           Yes       168.0             360.0             1.0                      Y
314  LP002024    NaN     Yes      Graduate            No       159.0             360.0             1.0                      N
334  LP002103    NaN     Yes      Graduate           Yes       182.0             180.0             1.0                      Y
460  LP002478    NaN     Yes      Graduate           Yes       160.0             360.0             NaN                      Y
467  LP002501    NaN     Yes      Graduate            No       110.0             360.0             1.0                      Y
477  LP002530    NaN     Yes      Graduate            No       132.0             360.0             0.0                      N
507  LP002625    NaN      No      Graduate            No        96.0             360.0             1.0                      N
576  LP002872    NaN     Yes      Graduate            No       136.0             360.0             0.0                      N
588  LP002925    NaN      No      Graduate            No        94.0             360.0             1.0                      Y
592  LP002933    NaN      No      Graduate           Yes       292.0             360.0             1.0                      Y

Handling Missing Value:

Imputation
Dropping

Imputation:

Use of fillna() with specific value

#Option 01,Use inplace to change in original Data Frame

df_dm['Gender'].fillna(value='Male',inplace=True) 

# Option 02,Assign to same variable instead of usnig inplace

df_dm['Gender']=df_dm['Gender'].fillna(value='Male')  

# Option 03,Only value can be used without value= assignment

df_dm['Gender'].fillna('Male',inplace=True)

Note: all missing value in “Gender” column will be filled by “Male”

Find specific value to fill in missing places or NaN
- For numerical data, if data is normal distribution, mean can be used using .mean()
- For numerical data, if data is skewed or not a normal distribution, the median can be used using .median()
- For categorical data, mode can be used using .mode()
- The .ffill() method, also known as “forward fill,” is used to fill missing values in a dataset by propagating the last valid (non-missing) observation forward to fill the gaps.
- The .bfill() method, also known as “backward fill,” is used to fill missing values in a dataset by propagating the next valid (non-missing) observation backward to fill the gaps.
- If Data sets are big .dropna() can be used to remove a few rows
- Using .loc[] function to impute missing on a specific column

# Fill missing value using mean
# Mean = the average value (the sum of all values divided by number of values).

#Option01 & best practice
x = df_dm['LoanAmount'].mean()           
df_dm['LoanAmount'].fillna(x,inplace=True)

#Option02
#df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean(),inplace=True)

#Option03
df_dm['LoanAmount']=df_dm['LoanAmount'].fillna(df_dm['LoanAmount'].mean())

# Fill missing value using median
# Median = the value in the middle, after you have sorted all values ascending.

#Option01 & Best Practice
x = df_dm['CoapplicantIncome'].median()         
df_dm['CoapplicantIncome'].fillna(x,inplace=True)

#Option02
#df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median(),inplace=True)

#Option03
df_dm['CoapplicantIncome']=df_dm['CoapplicantIncome'].fillna(df_dm['CoapplicantIncome'].median())

# Fill missing value using mode
# Mode = The value that appears most frequently

#Option01 & Best Practice
x = df_dm['Gender'].mode()[0]          
df_dm['Gender'].fillna(x,inplace=True)

#Option02 
# In case of Mode ,there might more than one mode value , [0] is used to provide first mode value

df_dm['Gender'].fillna(df_dm['Gender'].mode()[0],inplace=True)   

#Option03
df_dm['Gender']=df_dm['Gender'].fillna(df_dm['Gender'].mode()[0])

import pandas as pd

# Sample DataFrame with missing values
data = {
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04', '2024-08-05'],
    'Temperature': [None, 25, None, 30, None],
    'Sales': [100, None, 150, None, 200]
}

df = pd.DataFrame(data)

# Apply forward fill (ffill) first
df_filled = df.ffill()

# Apply backward fill (bfill) next to handle any remaining NaNs
df_filled = df_filled.bfill()

# Use of dropna() to remove rows containing null values

#Option01
df_dm.dropna(inplace=True) 

#Option02, Creating new DataFrame removing null values
df_dm_new = df_dm.dropna()

# Using loc function to impute missing on specific column
#.isnull()ensure all columns containing rows only where null value present, 'Credit_History' specified only Credit_History column

df_dm.loc[df_dm['Credit_History'].isnull(),'Credit_History']=1

Encoding Categorical Variables

Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms.

There are two common ways to convert categorical variables into numeric variables:

Label Encoding: Assign an integer value to each categorical value based on alphabetical order.

For example, suppose we have the following dataset with two variables and we would like to convert the Team variable from a categorical variable into a numeric one:

Using label encoding, we would convert each unique value in the Team column into an integer value based on alphabetical order:

In this example, we can see:

Each “A” value has been converted to 0.

Each “B” value has been converted to 1.

Each “C” value has been converted to 2.

We have successfully converted the Team column from a categorical variable into a numeric variable.

One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.

Using one hot encoding, we would convert the Team column into new variables that contain only 0 and 1 values.

When using this approach, we create one new column for each unique value in the original categorical variable.

For example, the categorical variable Team had three unique values so we created three new columns in the dataset that all contain 0 or 1 values.

Here’s how to interpret the values in the new columns:

The value in the new Team_A column is 1 if the original value in the Team column was A. Otherwise, the value is 0.

The value in the new Team_B column is 1 if the original value in the Team column was B. Otherwise, the value is 0.

The value in the new Team_C column is 1 if the original value in the Team column was C. Otherwise, the value is 0.

We have successfully converted the Team column from a categorical variable into three numeric variables – sometimes referred to as “dummy” variables.

How to choose technique:

In most scenarios, one hot encoding is the preferred way to convert a categorical variable into a numeric variable because label encoding makes it seem that there is a ranking between values.

The label-encoded data makes it seem like team C is somehow greater or larger than teams B and A since it has a higher numeric value.

This isn’t an issue if the original categorical variable actually is an ordinal variable with a natural ordering or ranking, but in many scenarios, this isn’t the case.

However, one drawback of one hot encoding is that it requires you to make as many new variables as there are unique values in the original categorical variable.

This means that if your categorical variable has 100 unique values, you’ll have to create 100 new variables when using one hot encoding.

Depending on the size of your dataset and the type of variables you’re working with, you may prefer one hot encoding or label encoding.

Python implementation for label encoding

import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/04. Data Preprocessing/data.csv')
data.drop('Unnamed: 0',axis=1,inplace=True)
datacopy = data.copy()
datacopy1 = data.copy()
print(data)
print('-------------------------')
print(datacopy)
print('-------------------------')
print(datacopy1)

Output:
    Gender Married
0      Male      No
1      Male     Yes
2      Male     Yes
3      Male     Yes
4      Male      No
..      ...     ...
609  Female      No
610    Male     Yes
611    Male     Yes
612    Male     Yes
613  Female      No

[614 rows x 2 columns]
-------------------------
     Gender Married
0      Male      No
1      Male     Yes
2      Male     Yes
3      Male     Yes
4      Male      No
..      ...     ...
609  Female      No
610    Male     Yes
611    Male     Yes
612    Male     Yes
613  Female      No

[614 rows x 2 columns]
-------------------------
     Gender Married
0      Male      No
1      Male     Yes
2      Male     Yes
3      Male     Yes
4      Male      No
..      ...     ...
609  Female      No
610    Male     Yes
611    Male     Yes
612    Male     Yes
613  Female      No

[614 rows x 2 columns]

from sklearn.preprocessing import LabelEncoder

lc=LabelEncoder()

data.Married=lc.fit_transform(data.Married)
print(data.Married)

Output:
0      0
1      1
2      1
3      1
4      0
      ..
609    0
610    1
611    1
612    1
613    0
Name: Married, Length: 614, dtype: int64

Python implementation for one hot encoding:

Approach 1: Using pd.get_dummies() from pandas

This approach utilizes pandas.get_dummies() function to one-hot encode the categorical variable.
It directly operates on the DataFrame column and returns a DataFrame with the encoded columns.
In this case, you are dropping the original column and concatenating the encoded columns to the original DataFrame.

df1=pd.get_dummies(datacopy['Married'],prefix='Married',drop_first=True)

data=pd.concat([datacopy,df1],axis=1).drop(['Married'],axis=1)

print(data)

Output:
     Gender  Married_Yes
0      Male        False
1      Male         True
2      Male         True
3      Male         True
4      Male        False
..      ...          ...
609  Female        False
610    Male         True
611    Male         True
612    Male         True
613  Female        False

[614 rows x 2 columns]

# With one line code
datacopy1 = pd.get_dummies(datacopy1,columns=['Gender','Married'],drop_first=True)
datacopy1

Output:
     Gender_Male  Married_Yes
0           True        False
1           True         True
2           True         True
3           True         True
4           True        False
..           ...          ...
609        False        False
610         True         True
611         True         True
612         True         True
613        False        False

[614 rows x 2 columns]

Approach 2: Using OneHotEncoder from scikit-learn

This approach utilizes scikit-learn’s OneHotEncoder to encode the categorical variable.
It requires reshaping the input array to a 2D structure before applying fit_transform().
The resulting encoded data will be a numpy array.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

ohe = OneHotEncoder(sparse=False)

# Reshape the input to a 2D array-like structure
datacopy1_reshaped = np.array(datacopy1.Gender).reshape(-1, 1)

datacopy1_encoded = ohe.fit_transform(datacopy1_reshaped)
datacopy1_encoded

Output:
array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

How to choose approaches

The choice between the two approaches depends on factors such as personal preference, ease of use, and compatibility with the rest of your code.
If we are working with pandas DataFrames and prefer a simpler and more concise solution, pd.get_dummies() can be a good option.
However, if you want more control over the encoding process or need to integrate it with other scikit-learn functionality, using OneHotEncoder may be more suitable.

Removing the Outlier:

Outlier is an abnormal value or abnormal distance from rest of the data points

Python implementation of finding & imputing outliers

# Importing library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

# Define dataset

student_age = [22,25,30,33,24,22,21,22,23,24,26,28,26,29,29,30,31,20,45,15]

Find outliers using z-score

# Defining function
outliers = []

def detect_outliers(data):
  threshold = 3 ## 3rd standard deviation from emphirical rule
  mean = np.mean(data)
  std = np.std(data)

  for i in data:
    z_score = (i-mean)/std
    if np.abs(z_score)>threshold:
      outliers.append(i)
  return outliers

#Finding outlier using created function

detect_outliers(student_age)

Output:
[45]

Find outliers using IQR

# sort data
student_age = sorted(student_age)
print("student_age :",student_age)

# calculating q1 & q3
q1,q3 = np.percentile(student_age,[25,75])
print("q1 :",q1,"q3 :",q3)

# calculting iqr
iqr = q3 - q1
print("iqr :",iqr)

# Finding lower bound(min value) and upper bound(max value)
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print("lower_bound :",lower_bound,"upper_bound :",upper_bound)


# Finding outlier
outliers = []

for i in student_age:
  if i<lower_bound or i>upper_bound:
    outliers.append(i)

print("outliers :",outliers)

Output:
student_age : [15, 20, 21, 22, 22, 22, 23, 24, 24, 25, 26, 26, 28, 29, 29, 30, 30, 31, 33, 45]
q1 : 22.0 q3 : 29.25
iqr : 7.25
lower_bound : 11.125 upper_bound : 40.125
outliers : [45]

Imputing outlier

student_age1=pd.Series(student_age)
student_age1.loc[student_age1>upper_bound] = np.mean(student_age1)
print(student_age1)

Output:
0     15.00
1     20.00
2     21.00
3     22.00
4     22.00
5     22.00
6     23.00
7     24.00
8     24.00
9     25.00
10    26.00
11    26.00
12    28.00
13    29.00
14    29.00
15    30.00
16    30.00
17    31.00
18    33.00
19    26.25
dtype: float64

Find & see outlier visually using boxplot

import seaborn as sns

#Before outlier removal
plt.subplot(2,2,1)
sns.boxplot(student_age,orient="h")
plt.title('Before outlier removal')

#After outlier removal
plt.subplot(2,2,2)
sns.boxplot(student_age1,orient="h")
plt.title('After outlier removal')

Feature Scaling

It is a technique to standardize the independent features in data in a fixed range or scale. Thus the name Feature Scaling.
Feature Scaling is one of the last steps in the whole life cycle of Feature Engineering.
Once we are done with all the other steps of feature engineering, like encoding variables, handling missing values, etc, then we scale all the variable
All the data gets squeezed to decimal points between -1 and +1.

Why Feature Scaling?

Real Life Datasets have many features with a wide range of values like for example let’s consider the house price prediction dataset.
It will have many features like no. of. bedrooms, square feet area of the house, etc.
As you can guess, the no. of bedrooms will vary between 1 and 5, but the square feet area will range from 500-2000.
This is a huge difference in the range of both features.
Without scaling, features with larger units or numerical ranges might dominate the model’s learning process, leading to biased predictions.
Some machine learning algorithms, especially those that rely on distance calculations or gradients, are sensitive to the scale of the features.

Which machine learning algorithm needs scaling?

Gradient descent and distance-based algorithms require feature scaling while tree-based algorithms do not require.

Types of Feature Scaling:

Standardization:
- Standard Scaler
Normalization:
- Min Max Scaling
- Mean Normalization
- Max Absolute Scaling
- Robust Scaling etc.

01. Standardization:

Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation.
This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Formula of Standardization: z = (x – μ )/σ, where x = values ,μ = mean ,σ = Standard Deviation
Scaling technique: StandardScaler
Fit_transform to be performed for train data set & transform to be performed for test data set to avoid data leakage

Python Implementation for StandardScaler:

# importing sklearn StandardScaler class which is for Standardization
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() # creating an instance of the class object
X_new = sc.fit_transform(X)

# plotting the scatterplot of before and after Standardization
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Standardization", fontsize=18)
sns.scatterplot(data = X, color="blue")
#sns.histplot(data=X ,color="red",kde=True)
plt.subplot(1,2,2)
plt.title("Scatterplot After Standardization", fontsize=18)
sns.scatterplot(data = X_new, color="blue")
#sns.histplot(data=X_new ,color="red",kde=True)
plt.tight_layout()
plt.show()

02. Normalization

Normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

Min Max Scaling

Min-max normalization is one of the most common ways to normalize data.
For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
Min Max Normalization will perform best when the maximum and minimum value is very distinct and known.
Formula of Min Max Scaling: Xsc=(X−Xmin)/(Xmax−Xmin)

Python Implementation for MinMaxScaler

# importing sklearn Min Max Scaler class which is for Standardization
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler() # creating an instance of the class object
X_new = mm.fit_transform(X) #fit and transforming

# plotting the scatterplot of before and after Min Max Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Min Max Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Min Max Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()

Max Absolute Scaling

Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0.
It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
Max Absolute scaling will perform a lot better in sparse data or when most of the values are 0.
Formula of Max Absolute Scaling: Xsc = X /|Xmax|

Python Implementation for MaxAbsScaler

# importing sklearn Min Max Scaler class which is for Max Absolute Scaling
from sklearn.preprocessing import MaxAbsScaler

ma = MaxAbsScaler() # creating an instance of the class object
X_new = ma.fit_transform(X) #fit and transforming

# plotting the scatterplot of before and after Max Absolute Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()

Robust Scaling

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).
The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Robust Scaling is best for data that has outliers
Formula of Robust Scaling: X(new) = (X – X(median)/(IQR)

Python Implementation for RobustScaler

# importing sklearn Min Max Scaler class which is for Robust scaling
from sklearn.preprocessing import RobustScaler

rs = RobustScaler() # creating an instance of the class object
X_new = rs.fit_transform(X) #fit and transforming

# plotting the scatterplot of before and after Robust Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Robust Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Robust Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()

Mean Normalization

It is very similar to Min Max Scaling, just that we use mean to normalize the data. Removes the mean from the data and scales it into max and min values.
Scikitlearn does not have any specific class for mean normalization. However, it can be done by using numpy.
Formula of Mean Normalization: X’ = (X – μ )/(max(X)-min(X))

Python Implementation for Normalizing

# importing sklearn Min Max Scaler class which is for Mean Normalization
from sklearn.preprocessing import normalize

X_new = normalize(X,axis=1) # creating an instance of the class object

# plotting the scatterplot of before and after Normalize Scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Normalize Scaling", fontsize=18)
sns.scatterplot(data = X, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Normalize Scaling", fontsize=18)
sns.scatterplot(data = X_new, color="red")
plt.tight_layout()
plt.show()

How to choose scaling technique

If not sure which scaler to use, apply all and check the effect on the models.
If not understand the data, use standard scaler. It works most of the times.
If we know the max and min values of the feature, then use min max scaler. Like in CNN.
If most of the values in the feature column is 0 or sparce matrix, then use Max Absolute Scaling
If the data has outliers, use Robust Scaling.

Transformations

Some Machine Learning models, like Linear and Logistic regression, assume that the variables follow a normal distribution. More likely, variables in real datasets will follow a skewed distribution.
By applying some transformations to these skewed variables, we can map this skewed distribution to a normal distribution so, this can increase the performance of our models.

Sklearn has three Transformations-

Function Transformation
- Log Transformation
- Reciprocal Transformation
- Square Transformation
- Square Root Transformation
Power Transformation
- Box-Cox Transformation
- Yeo-Johnson Transformation
Quantile transformation

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
from scipy.stats import boxcox
from scipy.stats import yeojohnson

# Loading Data Sets
df= pd.read_csv("/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/04. Data Preprocessing/heart_disease_uci.csv")

Log Transformation & Python implementation:

Generally, these transformations make our data close to a normal distribution but are not able to exactly abide by a normal distribution.
This transformation is not applied to those features which have negative values.
This transformation is mostly applied to right-skewed data.
Convert data from addictive Scale to multiplicative scale i,e, linearly distributed data.

# Creating log normal distribution as we dont have real data

rightskewed =np.random.lognormal(4,1,10000)
plt.subplot(2,2,1)
sns.histplot(rightskewed)
plt.title("Right skewed distribution")

# Log Transformation
Normal_dis_data = np.log(rightskewed)
plt.subplot(2,2,2)
sns.histplot(Normal_dis_data)
plt.title("Normal distribution")

Square Root Transformation & Python Implementation:

This transformation is defined only for positive numbers.
This transformation is weaker than Log Transformation.
This can be used for reducing the skewness of right-skewed data.

import numpy as np
import matplotlib.pyplot as plt

#make this example reproducible
np.random.seed(0)

#create beta distributed random variable with 200 values
data = np.random.beta(a=1, b=5, size=300)

#create sqrt-transformed data
data_sqrt = np.sqrt(data)

#define grid of plots
fig, axs = plt.subplots(nrows=1, ncols=2)

#create histograms
axs[0].hist(data, edgecolor='black')
axs[1].hist(data_sqrt, edgecolor='black')

#add title to each histogram
axs[0].set_title('Original Data')
axs[1].set_title('Square Root Transformed Data')

Reciprocal transformation & Python Implementation:

This transformation, x will replace by the inverse of x (1/x).
The reciprocal transformation will give little effect on the shape of the distribution.
This transformation can be only used for non-zero values.

plt.figure(figsize=(9,3))

plt.subplot(1,4,1)
sns.histplot(df["trestbps"],kde=True)
plt.title("DISTRIBUTION BEFORE",)

#Transformation

plt.subplot(1,4,3)
reciprocal_data = 1/df.trestbps
sns.histplot(reciprocal_data,kde =True)
plt.title("DISTRIBUTION AFTER ",)

Box-cox transformation & Python Implementation:

Box-cox requires the input data to be strictly positive(not even zero is acceptable).
Mathematical formula for Box-cox transformation:

All the values of lambda vary from -5 to 5 are considered and the best value for the data is selected.
The “Best” value is one that results in the best skewness of the distribution.
Log transformation will take place when we have lambda is zero.

plt.figure(figsize=(15,4))

plt.subplot(1,4,1)
sns.histplot(df["trestbps"],kde=True)
plt.title("DISTRIBUTION BEFORE",)

# Transformation

plt.subplot(1,4,3)
bcx_target, lamda = boxcox(df["trestbps"])
sns.histplot(bcx_target,kde=True)
plt.title("DISTRIBUTION AFTER BOX-COX ",)

YEO-JOHNSON Transformation & Python Implementation:

This is one of the older transformation technique which is very similar to Box-cox transformation but does not require the values to be strictly positive.
This transformation is also having the ability to make the distribution more symmetric.
For features that have zeroes or negative values, Yeo-Johnson comes to the rescue in place of Box-cox

plt.figure(figsize=(15,4))

plt.subplot(1,4,1)
sns.distplot(df["chol"])
plt.title("DISTRIBUTION BEFORE",)

# Transformation
plt.subplot(1,4,3)
yf_target, lam = yeojohnson(df["chol"])
sns.distplot(yf_target)
plt.title("DISTRIBUTION AFTER YEO JHONSON ",)

Balancing the Data:

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you’ll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).
Some Machine Learning algorithms are more sensitive toward imbalanced data, such as Logistic Regression and Support Vector Machine. However, some algorithms tackle this issue themselves, such as Random Forest and XGBoost.
Two main causes for the imbalance dataset:
- Data sampling (Biased Sampling,Measurement Errors etc)
- Properties of the domain
Sampling technique is used to deal with imbalanced data.
Sampling: The idea behind sampling is to create new samples or choose some records from the whole data set.
There are two sampling techniques available to handle the imbalanced data:
1. Under Sampling
2. Over Sampling

Python Implementation for balancing the data with different technique:

# import necessary libraries
import pandas as pd
import numpy as np
import imblearn
import matplotlib.pyplot as plt
import seaborn as sns

# Creating imbalanced data set
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /05_KNN/KNN Class/diabetes.csv')

# Creating Independent & dependent variable

X = data.iloc[:,0:8]
y = data.iloc[:,-1]

y.value_counts() # To see imbalanced data set

print((y.value_counts()/y.value_counts().sum())*100)

Output:
Outcome
0    65.8
1    34.2
Name: count, dtype: float64

# Installing library

!pip install imblearn

Output:
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.3)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.3.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: imblearn
Successfully installed imblearn-0.0

Undersampling:

It Balance the class distribution for skewed class distribution.
The basic Undersampling technique removes the examples randomly from the majority class, referred to as RandomUnderSampler.
Cons:There is a risk of losing useful or important information that could determine the decision boundary between the classes.

# Use RandomUnderSampler to balance the data

from imblearn.under_sampling import RandomUnderSampler

model = RandomUnderSampler() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({0: 684, 1: 684})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (1368, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
0    684
1    684
Name: count, dtype: int64

Near Miss Undersampling:

This technique selects the data points based on the distance between majority and minority class examples.
It has three versions of itself, and each of these considers the different neighbors from the majority class.
- Version 1 keeps examples with a minimum average distance to the nearest records of the minority class.
- Version 2 selects rows with a minimum average distance to the furthest records of the minority class.
- Version 3 keeps examples from the majority class for each closest record in the minority class.
  Among these, version 3 is more accurate since it considers examples of the majority class that are on the decision boundary.

# Use Near Miss Undersampling to balance the data

from imblearn.under_sampling import NearMiss

model = NearMiss() ## object creation it has version = 1,2,3

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({0: 684, 1: 684})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (1368, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
0    684
1    684
Name: count, dtype: int64

Condensed Nearest Neighbor (CNN) Undersampling:

This technique aspires to a subset of a collection of samples that minimizes the model loss.
These examples are stores in a store that then consists of examples from the minority class and incorrectly classified examples from the majority class.

# Use Condensed Nearest Neighbor (CNN) to balance the data

from imblearn.under_sampling import CondensedNearestNeighbour

model = CondensedNearestNeighbour() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({1: 684, 0: 232})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (916, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
1    684
0    232
Name: count, dtype: int64

Tomek Links Undersampling:

This technique is the modified version of CNN in which the redundant examples get selected randomly for deletion from the majority class.
Since it uses redundant examples, it barely balances the data.

# Use Tomek Links to balance the data

from imblearn.under_sampling import TomekLinks

model = TomekLinks() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({0: 1316, 1: 684})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (2000, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
0    1316
1     684
Name: count, dtype: int64

Neighborhood Cleaning Undersampling:

This approach is a combination of CNN and ENN techniques.
Initially, it selects all the minority class examples.
Then ENN identifies the ambiguous samples to remove from the majority class. – Then CNN deletes the misclassified examples against the store if the majority class has more than half of the minority class examples.

# Use Neighborhood Cleaning rule to balance the data

from imblearn.under_sampling import NeighbourhoodCleaningRule

model = NeighbourhoodCleaningRule() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({0: 1219, 1: 684})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (1903, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
0    1219
1     684
Name: count, dtype: int64

Oversampling:

Oversampling focuses on increasing minority class samples.
Duplicate the examples to increase the minority class samples.
Although it balances the data, it does not provide additional information to the classification model.
Therefore synthesizing new examples using an appropriate technique is necessary. Here SMOTE comes into the picture.

SMOTE:

Synthetic Minority Over-sampling Technique (SMOTE) is a technique that generates new observations by interpolating between observations in the original dataset.
Interpolation is done with the help of KNN alogrithm
SMOTE picks an instance randomly from the minority class. Then it finds its k nearest neighbors from the minority class itself. Then one of the neighbors gets chosen randomly and draws the line between these two instances. Then new synthetic examples are generated using a convex combination of these two instances.”
Can Not SMOTE the testing data. Because when the model is on production,the data may or may not be balanced one.

# Use SMOTE to balance the data

from imblearn.over_sampling import SMOTE

model = SMOTE() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({1: 1316, 0: 1316})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (2632, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
1    1316
0    1316
Name: count, dtype: int64

Borderline-SMOTE:

This SMOTE extension selects the minority class instance that is misclassified with a k-nearest neighbor (KNN) classifier.
Since borderline or distant examples are more tend to misclassified.

# Use BorderlineSMOTE to balance the data

from imblearn.over_sampling import BorderlineSMOTE

model = BorderlineSMOTE() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({1: 1316, 0: 1316})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (2632, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
1    1316
0    1316
Name: count, dtype: int64

Borderline-SMOTE SVM:

This method selects the misclassified instances of Support Vector Machine (SVM) instead of KNN.

# Use Borderline-SMOTE SVM to balance the data

from imblearn.over_sampling import SVMSMOTE

model = SVMSMOTE() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({1: 1316, 0: 1316})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (2632, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
1    1316
0    1316
Name: count, dtype: int64

Adaptive Synthetic Sampling (ADASYN):

This approach works according to the density of the minority class instances. Generating new samples is inversely proportional to the density of the minority class samples.
It generates more samples in the feature space region where minority class examples density is low or none and fewer samples in the high-density space.

# Use ADASYNM to balance the data

from imblearn.over_sampling import ADASYN

model = ADASYN() ## object creation

X_bal,y_bal = model.fit_resample(X,y)

# See comparison between original data & balanced data

from collections import Counter
print("Actual Classes",Counter(y))
print("Balanced Classes",Counter(y_bal))
print('-----------------------------------------------------')
print("Shape of actual independent variable :",X.shape)
print("Shape of balanced independent variable :",X_bal.shape)
print('-----------------------------------------------------')
print('Count of class of y variable :',y.value_counts())
print('Count of class of y_bal variable :',y_bal.value_counts())

Output:
Actual Classes Counter({0: 1316, 1: 684})
Balanced Classes Counter({1: 1344, 0: 1316})
-----------------------------------------------------
Shape of actual independent variable : (2000, 8)
Shape of balanced independent variable : (2660, 8)
-----------------------------------------------------
Count of class of y variable : Outcome
0    1316
1     684
Name: count, dtype: int64
Count of class of y_bal variable : Outcome
1    1344
0    1316
Name: count, dtype: int64

What is the difference between fit, transform & fit_transform in machine learning?

In machine learning, fit, transform, and fit_transform are methods used in the context of feature scaling, data preprocessing, and model training. Here’s what each of them does:

fit:

In the context of preprocessing, the fit method is used to compute the necessary parameters needed to perform some transformation on the data.
For instance, if you’re using a feature scaling technique like standardization (subtracting the mean and dividing by the standard deviation), you use the fit method to calculate the mean and standard deviation from your data.
These parameters are then used for transforming the data.

#Example:
X_train = [[23,456,12,23,4,56,78,987,45,12,22,32,12]]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)  # Computes mean and standard deviation from X_train

Output:

StandardScaler
StandardScaler()

transform:

Once you have computed the parameters using the fit method, you can use the transform method to apply the transformation to a particular dataset.
This is useful when you have a separate dataset (e.g., test data) that you want to transform using the same parameters computed from the training data.

#Example:
X_test = [[12,23,45,65,76,87,23,23,23,4,56,45,12]]
X_test_scaled = scaler.transform(X_test)  # Applies the same transformation to X_test using mean and standard deviation computed from X_train
X_test_scaled

Output:
array([[ -11., -433.,   33.,   42.,   72.,   31.,  -55., -964.,  -22.,
          -8.,   34.,   13.,    0.]])

fit_transform:

This is a shorthand method that combines the fit and transform steps.
It computes the parameters and applies the transformation in one step.
It is often more efficient than calling fit and then transform separately, especially for large datasets, as it can sometimes optimize the computation.

#Example:
X_train_scaled = scaler.fit_transform(X_train)  # Computes mean and standard deviation from X_train and applies the transformation in one step

In summary:

Use fit to compute parameters from the training data.
Use transform to apply the computed parameters to any dataset.
Use fit_transform to perform both steps in one go, typically on the training data for efficiency.

Data Preprocessing:

Basic Techniques:

When to do Preprocessing:

Removing Unwanted Column:

Removing Duplicated Value:

Display Only Duplicate Rows:

Remove Duplicate Rows:

Imputing Missing Values:

Handling Missing Value:

Encoding Categorical Variables

Removing the Outlier:

Find outliers using IQR

Imputing outlier

Find & see outlier visually using boxplot

Feature Scaling

Types of Feature Scaling:

01. Standardization:

02. Normalization

Robust Scaling

Mean Normalization

Transformations

Balancing the Data:

Undersampling:

Oversampling:

SMOTE:

What is the difference between fit, transform & fit_transform in machine learning?

Social Profile

Data Driven Fashion

Data Preprocessing:

Basic Techniques:

When to do Preprocessing:

Removing Unwanted Column:

Removing Duplicated Value:

Display Only Duplicate Rows:

Remove Duplicate Rows:

Imputing Missing Values:

Handling Missing Value:

Encoding Categorical Variables

Removing the Outlier:

Find outliers using IQR

Imputing outlier

Find & see outlier visually using boxplot

Feature Scaling

Types of Feature Scaling:

01. Standardization:

02. Normalization

Robust Scaling

Mean Normalization

Transformations

Balancing the Data:

Undersampling:

Oversampling:

SMOTE:

What is the difference between fit, transform & fit_transform in machine learning?

Register

Login here

Forgot your password?

Subscribe to our email list

Social Profile

Data Driven Fashion