The Curse of Dimensionality:

Often, data Scientists get datasets which have thousand of features.

These create two kind of problems:

  • Increase in computation time:

    Majority of the machine learning algorithms they rely on the calculation of distance for model building and as the number of dimensions increases it becomes more and more computation-intensive to create a model out of it.

    One more point to consider is that as the number of dimension increases, points are going far away from each other.

  • Hard (or almost impossible) to visualise the relationship between features:

    Humans are bound by their perception of a maximum of three dimensions.

    We can’t comprehend shapes/graphs beyond three dimensions.

    So, if we have an n-dimensional dataset, the only solution left to us is to create either a 2-D or 3-D graph out of it.

Disadvantages of having more dimensions:

  • Training time increases
  • Data Visualization becomes difficult
  • Computational resources requirement increases
  • Chances of overfitting is high
  • Difficult to explore the data.

Two ways to remove curse of dimensionality

  1. Feature Selection- Drop less important feature

  2. Dimensionality Reduction- Derive new feature from set of feature which is called feature extraction

    Among many algorithm, we will discuss PCA here.

Principal Component Analysis (PCA)

  • The principal component analysis is an unsupervised machine learning algorithm used for feature selection using dimensionality reduction techniques.

  • As the name suggests, it finds out the principal components from the data.

  • PCA transforms and fits the data from a higher-dimensional space to a new, lower-dimensional subspace

  • This results into an entirely new coordinate system of the points where the first axis corresponds to the first principal component that explains the most variance in the data.

  • The PCA algorithm is based on some mathematical concepts such as:

    • Variance and Covariance
    • Eigenvalues and Eigen factors

What are the principal components?

  • Principal components are the derived features which explain the maximum variance in the data.

  • The first principal component explains the most variance, the 2nd a bit less and so on.

  • Each of the new dimensions found using PCA is a linear combination of the original features.

Some common terms used in PCA algorithm:

  • Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the number of columns present in the dataset.

  • Correlation: It signifies that how strongly two variables are related to each other. Such as if one changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are directly proportional to each other.

  • Orthogonal: It defines that variables are not correlated to each other, and hence the correlation between the pair of variables is zero.

  • Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is the scalar multiple of v.

  • Covariance Matrix: A matrix containing the covariance between the pair of variables is called the Covariance Matrix.

Explained Variance Ratio

  • It represents the amount of variance each principal component is able to explain.

  • The total variance is the sum of variances of all individual principal components.

  • The fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance.

For example,

  • Variance of PC1 is 50 and

  • Variance of PC2 is 5.

    So the total variance is 55.

EVR of PC1= Variance of PC1 / (Totalvariance)=50/55=0.91

EVR of PC1= Varianceof PC2 / (Totalvariance)=5/55=0.09

Thus PC1 explains 91% of the variance of data. Whereas, PC2 only explains 9% of the variance. Hence we can use only PC1 as the input for our model as it explains the majority of the variance.

In a real-life scenario, this problem is solved using the Scree Plots.

Steps involved in PCA:

  1. Scaling the data: PCA tries to get the features with the maximum variance and the variance is high for high magnitude features. So we need to scale the data.

  2. Calculate the covariance: to understand the variables that are highly correlated.

  3. Calculate eigen vectors and eigen values (they are computed from covariance).

    • Eigen vectors determine the direction of new feature space.

    • Eigen values determine their magnitude ie., the scalar of the respective eigen vectors.

    • For example:

      If you have 2 dimensional dataset, there will be 2 eigen vectors and their respective eigen values.

      Reason for having the eigen vectors is to use the covariance matrix to understand where in the data, there is more amount of variance.

      The covariance matrix generally gives the overall variance among all the variables in the data.

      More the variance denotes more information about the data.

      So eigen vector will tell where in the data, we have maximum variance.

  4. Compute the Principal Components:

    • After identifying eigen vectors and eigen values, sort them in descending order. Highest eigen value is the most siginificant component.
    • PCs are the new features that are obtained and they posses most of the useful information that was scattered among the initial variables.
    • These PCs are orthogonal to each other ie., the correlation between 2 variables will be zero.
  5. Reduce the dimensions of the data:

    • Eliminate the PCs that have least eigen value.
    • They are not important.

Scree Plots:

  • Scree plots are the graphs that convey how much variance is explained by corresponding Principal components.
  • The Scree Plot helps in deciding how many components to retain by identifying the “elbow” in the plot.

Example:

  • If we see the above plot where the explained variance drops significantly from Component 1 to Component 2, and then the drop becomes smaller and more gradual from Component 3 onwards, the elbow point would likely be at Component 2.
  • This means that the first two principal components explain most of the variance, and adding more components has diminishing returns.

Python Implementation of PCA:

# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Load Dataset

data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /10_Dimensionality-Reduction/PCA Class/glass.data')
data.head()
Output:
  index       RI     Na    Mg    Al     Si     K    Ca   Ba   Fe  Class
0      1  1.52101  13.64  4.49  1.10  71.78  0.06  8.75  0.0  0.0      1
1      2  1.51761  13.89  3.60  1.36  72.73  0.48  7.83  0.0  0.0      1
2      3  1.51618  13.53  3.55  1.54  72.99  0.39  7.78  0.0  0.0      1
3      4  1.51766  13.21  3.69  1.29  72.61  0.57  8.22  0.0  0.0      1
4      5  1.51742  13.27  3.62  1.24  73.08  0.55  8.07  0.0  0.0      1
data.Class.value_counts()
Output:
Class
2    76
1    70
7    29
3    17
5    13
6     9
Name: count, dtype: int64

EDA – Skipping

Data Preprocessing

data.isnull().sum()
Output:
index    0
RI       0
Na       0
Mg       0
Al       0
Si       0
K        0
Ca       0
Ba       0
Fe       0
Class    0
dtype: int64
# Creating x and y

x = data.drop(columns=['index','Class'],axis=1)
y = data.Class
# Splitting training & testing data

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=73)

# Creating model

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(x_train,y_train)

y_predict = lr.predict(x_test)
from sklearn.metrics import accuracy_score

score1 = accuracy_score(y_test,y_predict)
score1
Output:
0.6976744186046512

Perform PCA

# Scaling down the data

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

scaled_data = sc.fit_transform(x)
scaled_data
Output:
array([[ 0.87286765,  0.28495326,  1.25463857, ..., -0.14576634,
        -0.35287683, -0.5864509 ],
       [-0.24933347,  0.59181718,  0.63616803, ..., -0.79373376,
        -0.35287683, -0.5864509 ],
       [-0.72131806,  0.14993314,  0.60142249, ..., -0.82894938,
        -0.35287683, -0.5864509 ],
       ...,
       [ 0.75404635,  1.16872135, -1.86551055, ..., -0.36410319,
         2.95320036, -0.5864509 ],
       [-0.61239854,  1.19327046, -1.86551055, ..., -0.33593069,
         2.81208731, -0.5864509 ],
       [-0.41436305,  1.00915211, -1.86551055, ..., -0.23732695,
         3.01367739, -0.5864509 ]])
# Creating new dataframe

new_data = pd.DataFrame(data=scaled_data,columns= x.columns)
new_data.head()
Output:
        RI        Na        Mg        Al        Si         K        Ca        Ba        Fe
0  0.872868  0.284953  1.254639 -0.692442 -1.127082 -0.671705 -0.145766 -0.352877 -0.586451
1 -0.249333  0.591817  0.636168 -0.170460  0.102319 -0.026213 -0.793734 -0.352877 -0.586451
2 -0.721318  0.149933  0.601422  0.190912  0.438787 -0.164533 -0.828949 -0.352877 -0.586451
3 -0.232831 -0.242853  0.698710 -0.310994 -0.052974  0.112107 -0.519052 -0.352877 -0.586451
4 -0.312045 -0.169205  0.650066 -0.411375  0.555256  0.081369 -0.624699 -0.352877 -0.586451
# Getting the optimal number of PCA
from sklearn.decomposition import PCA

pca = PCA()

pca.fit_transform(new_data)

pca.explained_variance_ratio_

Output:
array([2.79018192e-01, 2.27785798e-01, 1.56093777e-01, 1.28651383e-01,
       1.01555805e-01, 5.86261325e-02, 4.09953826e-02, 7.09477197e-03,
       1.78757536e-04])
# Scree plot

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Explained Variance')

From the diagram, it can be seen 5 principal components explain almost 90% of the variance of the data .

So instead of giving all features as inputs, we’d only feed these 5 principal components of the data to the machine learning algorithm and we’d obtain a similar result

pca = PCA(n_components=5)
final_data = pca.fit_transform(new_data)

df = pd.DataFrame(data=final_data,
                  columns=['pca1','pca2','pca3','pca4','pca5'])

x1 = df
# Splitting training & testing data

from sklearn.model_selection import train_test_split

x1_train,x1_test,y_train,y_test = train_test_split(x1,y,test_size=0.2,random_state=73)

# Creating model

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(x1_train,y_train)

y1_predict = lr.predict(x1_test)
from sklearn.metrics import accuracy_score

score2 = accuracy_score(y_test,y1_predict)
score2
Output:
0.627906976744186

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies