Pearson Correlation Coefficient :The Pearson Correlation Coefficient is basically used to find out the strength of the linear relation between two continuous variables, it is represented using r.
- Pearson Correlation only works on continuous numerical data and not on categorical data
- The value of this correlation coefficient ranges from -1 to 1,
- 1 signifies positive correlation Fig(1),
- 0 represents no correlation Fig(2) and
- -1 represents negative correlation between the two variables Fig(3).
          Fig(1)                           Fig(2)                      Fig(3)
Python Implementation to find the correlation between continuous variables:
Using SciPy:
- It is preferred to use stats.pearsonr() when you want to compute correlation for lesser columns.
# importing libraries
import scipy.stats as stats
import numpy as np
# Creating data sets
data_1 = np.random.randn(1,20)
data_2 = np.random.randn(1,20)
# Find out correlation coefficient & p-value
stats.pearsonr(data_1[0],data_2[0])
Output: PearsonRResult(statistic=0.11598285514859002, pvalue=0.6262995823042874)
Interpretation: Based on this result, it appears that there is little to no significant linear relationship between the two variables you examined. The correlation coefficient is close to 0, and the p-value is relatively large, indicating that any observed correlation is likely due to random chance rather than a meaningful relationship.
In summary, a small p-value (typically < 0.05) would suggest a significant relationship, while a large p-value suggests that the observed correlation is not statistically significant.
Using Pandas Dataframe:
- Pandas has a built-in function called .corr() to find out Pearson correlation so we will use the same.
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Load data set from default
data = sns.load_dataset("iris")
# Find out pearson correlation
data.corr(method="pearson")
Ouput: sepal_length sepal_width petal_length petal_width sepal_length 1.000000 -0.117570 0.871754 0.817941 sepal_width -0.117570 1.000000 -0.428440 -0.366126 petal_length 0.871754 -0.428440 1.000000 0.962865 petal_width 0.817941 -0.366126 0.962865 1.000000
Using Pairplot:
sns.pairplot(data)
Using Heatmaps:
Every box of this heat-map will be a representation of the correlation coefficient between the corresponding columns in the grid.
In this, to represent more common values or higher activities brighter colors basically reddish colors are used and to represent less common or activity values, darker colors are preferred
# Getting correlation heat map usinf in-buil function heatmap from seaborn
sns.heatmap(data.corr(),annot=True)
plt.title("Correlation Heatmap")
QQ Plot ( quantile-quantile plot):
The quantile-quantile plot is a graphical method for determining whether two samples of data came from the same population or not.
A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set.
Usage:
The Quantile-Quantile plot is used for the following purpose:
- Determine whether two samples are from the same population.
- Whether two samples have the same tail
- Whether two samples have the same distribution shape.
- Whether two samples have common location behavior.
Types of Q-Q plots:
For Left-tailed distribution:
- For the uniform distribution:
How to Decide whether the given Q-Q plot corresponds to normal distribution or not?
In the Q-Q plots, if the variable follows a Normal distribution, then the variable’s values should fall in a line of slope 45-degree(y=x) when plotted against the theoretical quantiles. The points should roughly follow a straight diagonal line. Any deviations from this line indicate departures from normality.
- Interpretation:
- Points along the diagonal line suggest a good fit to a normal distribution.
- Points deviating from the line suggest departures from normality:
- Points curving upward indicate heavier tails than a normal distribution (right-skewed).
- Points curving downward indicate lighter tails than a normal distribution (left-skewed).
- S-shaped patterns can indicate non-linear relationships.
Python implementaion for QQ plot:
QQ plot which follows Normal Distribution:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generate example data from a normal distribution
np.random.seed(0)
data_normal = np.random.normal(0, 1, 1000)
# Create QQ plot for normal distribution
stats.probplot(data_normal, dist="norm", plot=plt)
plt.title('Q-Q Plot of Normal Distribution')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Ordered Values')
plt.grid(True)
plt.show()
QQ plot which does not follow Normal Distribution:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generate data from a chi-squared distribution (not normal)
np.random.seed(0)
data_not_normal = np.random.chisquare(df=3, size=1000) # Chi-squared with 3 degrees of freedom
# Create QQ plot for the non-normal data
stats.probplot(data_not_normal, dist="norm", plot=plt)
plt.title('Q-Q Plot of Non-Normal Distribution (Chi-Squared)')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Ordered Values')
plt.grid(True)
plt.show()