Pearson Correlation Coefficient :The Pearson Correlation Coefficient is basically used to find out the strength of the linear relation between two continuous variables, it is represented using r.

  • Pearson Correlation only works on continuous numerical data and not on categorical data
  • The value of this correlation coefficient ranges from -1 to 1,
    • 1 signifies positive correlation Fig(1),
    • 0 represents no correlation Fig(2) and
    • -1 represents negative correlation between the two variables Fig(3).

                   Fig(1)                                                     Fig(2)                                            Fig(3)

Python Implementation to find the correlation between continuous variables:

Using SciPy:

  • It is preferred to use stats.pearsonr() when you want to compute correlation for lesser columns.
# importing libraries

import scipy.stats as stats
import numpy as np

# Creating data sets

data_1 = np.random.randn(1,20)
data_2 = np.random.randn(1,20)

# Find out correlation coefficient & p-value

stats.pearsonr(data_1[0],data_2[0])
Output:
PearsonRResult(statistic=0.11598285514859002, pvalue=0.6262995823042874)

Interpretation: Based on this result, it appears that there is little to no significant linear relationship between the two variables you examined. The correlation coefficient is close to 0, and the p-value is relatively large, indicating that any observed correlation is likely due to random chance rather than a meaningful relationship.

In summary, a small p-value (typically < 0.05) would suggest a significant relationship, while a large p-value suggests that the observed correlation is not statistically significant.

Using Pandas Dataframe:

  • Pandas has a built-in function called .corr() to find out Pearson correlation so we will use the same.
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Load data set from default

data = sns.load_dataset("iris")


# Find out pearson correlation

data.corr(method="pearson")
Ouput:	
              sepal_length	sepal_width	petal_length	petal_width
sepal_length	1.000000	-0.117570	0.871754	0.817941
sepal_width	-0.117570	1.000000	-0.428440	-0.366126
petal_length	0.871754	-0.428440	1.000000	0.962865
petal_width	0.817941	-0.366126	0.962865	1.000000

Using Pairplot:

sns.pairplot(data)

Using Heatmaps:

  • Every box of this heat-map will be a representation of the correlation coefficient between the corresponding columns in the grid.

  • In this, to represent more common values or higher activities brighter colors basically reddish colors are used and to represent less common or activity values, darker colors are preferred

# Getting correlation heat map usinf in-buil function heatmap from seaborn
sns.heatmap(data.corr(),annot=True)
plt.title("Correlation Heatmap")

QQ Plot ( quantile-quantile plot):

  • The quantile-quantile plot is a graphical method for determining whether two samples of data came from the same population or not.

  • A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set.

  • Usage:

    The Quantile-Quantile plot is used for the following purpose:

    • Determine whether two samples are from the same population.
    • Whether two samples have the same tail
    • Whether two samples have the same distribution shape.
    • Whether two samples have common location behavior.
  • Types of Q-Q plots:

    • For Left-tailed distribution:

    • For the uniform distribution:

How to Decide whether the given Q-Q plot corresponds to normal distribution or not?

In the Q-Q plots, if the variable follows a Normal distribution, then the variable’s values should fall in a line of slope 45-degree(y=x) when plotted against the theoretical quantiles. The points should roughly follow a straight diagonal line. Any deviations from this line indicate departures from normality.

  • Interpretation:
    • Points along the diagonal line suggest a good fit to a normal distribution.
    • Points deviating from the line suggest departures from normality:
      • Points curving upward indicate heavier tails than a normal distribution (right-skewed).
      • Points curving downward indicate lighter tails than a normal distribution (left-skewed).
      • S-shaped patterns can indicate non-linear relationships.

Python implementaion for QQ plot:

QQ plot which follows Normal Distribution:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate example data from a normal distribution
np.random.seed(0)
data_normal = np.random.normal(0, 1, 1000)

# Create QQ plot for normal distribution
stats.probplot(data_normal, dist="norm", plot=plt)
plt.title('Q-Q Plot of Normal Distribution')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Ordered Values')
plt.grid(True)
plt.show()

QQ plot which does not follow Normal Distribution:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate data from a chi-squared distribution (not normal)
np.random.seed(0)
data_not_normal = np.random.chisquare(df=3, size=1000)  # Chi-squared with 3 degrees of freedom

# Create QQ plot for the non-normal data
stats.probplot(data_not_normal, dist="norm", plot=plt)
plt.title('Q-Q Plot of Non-Normal Distribution (Chi-Squared)')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Ordered Values')
plt.grid(True)
plt.show()

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies