Measure of Central Tendency:
A measure of central tendency is a single value that describes a set of data by identifying the central position within that set of data.
3 types of central tendency as below:
- Mean:
- The mean is equal to the sum of all the values in the data set divided by the number of values in the data set or average of the values
- The disadvantage of mean: it is influenced by outliers (if present in data set)
- Commonly used for imputation for normal distributed data
- Median:
- Median is the middle number in a sequence of numbers after arranging numbers from smallest to largest
- Median is not influenced by outliers
- It is used for imputation for skewed distributed data
- Mode:
- Mode is the value that occurs most often within a set of numbers
- It is used for categorical data
- Disadvantage of mode: there might be several mode in data set.
Python implementation for Mean, Median & Mode:
# 01. Using pandas library
import pandas as pd
# Creating data set using list
age_list =[32,35,32,37,32,38,39,40,35,35,36]
# Creating data set using 1D array(series)
age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])
print("Using 1D array or series")
print("mean :",age_series.mean())
print("median :",age_series.median())
print("mode :",age_series.mode()[0]) # Use [0] to get first mode in case of several mode
print("-----------------------------")
print("Using list")
print("mean :",age_list.mean())
print("median :",age_list.median())
print("mode :",age_list.mode())
Output: Using 1D array or series mean : 37.09090909090909 median : 38.0 mode : 35 dtype: int64 ----------------------------- Using list AttributeError:'list' object has no attribute 'mode','mean','median'
# 02. Using Numpy library
import numpy as np
# Creating data set using list
age_list =[32,35,32,37,32,38,39,40,35,35,36]
# Creating data set using 1D array(series)
age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])
print("Using 1D array or series")
print("mean :",np.mean(age_series))
print("median :",np.median(age_series))
#print("mode :",np.mode(age_series)) #Not executed as 'numpy' has no attribute 'mode'
print("-----------------------------")
print("Using list")
print("mean :",np.mean(age_list))
print("median :",np.median(age_list))
#print("mode :",np.mode(age_list)) #Not executed as 'numpy' has no attribute 'mode'
Output: Using 1D array or series mean : 37.09090909090909 median : 38.0 ----------------------------- Using list mean : 35.54545454545455 median : 35.0
# 03. Using Statistics library
import statistics as stats
# Creating data set using list
age_list =[32,35,32,37,32,38,39,40,35,35,36]
# Creating data set using 1D array(series)
age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])
print("Using 1D array or series")
print("mean :",stats.mean(age_series))
print("median :",stats.median(age_series))
print("mode :",stats.mode(age_series))
print("-----------------------------")
print("Using list")
print("mean :",stats.mean(age_list))
print("median :",stats.median(age_list))
print("mode :",stats.mode(age_list))
#Statistics module can return mean,median,mode from both list & array
Output: Using 1D array or series mean : 37.09090909090909 median : 38 mode : 39 ----------------------------- Using list mean : 35.54545454545455 median : 35 mode : 32
# 04. Using scipy library
from scipy import stats
# Creating data set using list
age_list =[32,35,32,37,32,38,39,40,35,35,36]
# Creating data set using 1D array(series)
age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])
print("Using 1D array or series")
#print("mean :",stats.mean(age_series)) #Not executed as 'stats' has no attribute 'mean'
#print("median :",stats.median(age_series)) #Not executed as 'stats' has no attribute 'median'
print("mode :",stats.mode(age_series))
print("-----------------------------")
print("Using list")
#print("mean :",stats.mean(age_list)) #Not executed as 'stats' has no attribute 'mean'
#print("median :",stats.median(age_list)) #Not executed as 'stats' has no attribute 'median'
print("mode :",stats.mode(age_list))
Using 1D array or series mode : ModeResult(mode=array([35]), count=array([3])) ----------------------------- Using list mode : ModeResult(mode=array([32]), count=array([3]))
Data Variability or Measure of Dispersion:
Spread: How the data is dispersed.
Commonly used measure of dispersion:
- Range: Maximum value – Minimum value
- Interquartile Range: Distance between the third and the first quartile(Q3-Q1). It is a way to measure the spread of the middle 50% of a dataset
- Percentiles: It is a value below which a given percentage of observations in a group of observations fall.
Location Lp = (n+1)*P/100, where n= No of observations, P= desired percentile.
For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the data points in the dataset fall.
- Quartiles : A quartile is a type of quantile.
The first quartile (Q1), is defined as the middle number between the smallest number and the median of the data set,(n+1)/4.
The second quartile (Q2) – median of the given data set,(n+1)/2
The third quartile (Q3), is the middle number between the median and the largest value of the data set,3(n+1)/4
- Variance: Variance is a measure of the width of a distribution or How much spread the data is.
Variance (σ2) = ∑(X−μ)2/N, where X is data points, μ is mean, N is number of values
- Standard Deviation: It is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
S.D. = √σ2 = σ
- Z-Score: Indicates how many standard deviation an element is from the mean.
Formula: z = (X – μ) / σ
Python implementation for percentile, quartile, interquartile range, variance, Standard deviation
# importing library
import numpy as np
import pandas as pd
from scipy import stats
# Creating dataset
data=[346,47,56,2,36,39,75,79,79,88,89,91,92,93,96,97,101,105,112,115]
data1=pd.Series(data)
# Calculating Quartile
#First quartile or 25% percentile
q1 = data1.quantile(0.25) #Using pandas 1d array
print("First quartile or 25% percentile,q1 :",q1)
q1 = np.percentile(data, 25, method="midpoint") #method was interpolation previously
print("First quartile or 25% percentile,q1 :",q1)
#Third quartile or 75% percentile
q3 = data1.quantile(0.75) #Using pandas 1d array
print("Third quartile or 75% percentile,q3 :",q3)
q3 = np.percentile(data, 75,method="midpoint") #method was interpolation previously
print("Third quartile or 75% percentile,q3 :",q3)
# Calculating IQR
IQR1=q3-q1 # Direct formula
print("IQR1 :",IQR1)
q3, q1 = np.percentile(data,[75,25]) #Using numpy & list
IQR2 = q3-q1
print("q1 :",q1)
print("q3 :",q3)
print("IQR2 :",IQR2)
q75, q25 = np.percentile(age_series,[75,25]) #Using numpy & pandas
IQR3 = q75-q25
print("IQR3 :",IQR3)
IQR4=stats.iqr(data,interpolation="midpoint") # Using stats
print("IQR4 :",IQR4)
## Note: Interpolation & method default is linear(i + (j - i)*fraction) ,
## if we use midpoint((i + j)/2) average of two value will be returned
# Calculating variance
Variance1=np.var(data,ddof = 1) #Using numpy , Formula divide by N-1
print("Variance1 :",Variance1)
Variance2=age_series.var(ddof = 1) #Using pandas , Formula divide by N
print("Variance2 :",Variance2)
# Calculating Standard Deviation
Standard_Deviation1=np.std(data,ddof = 1) #Using numpy , Formula divide by N-1
print("Standard_Deviation1 :",Standard_Deviation1)
Standard_Deviation2=age_series.std(ddof = 1) #Using pandas , Formula divide by N
print("Standard_Deviation2 :",Standard_Deviation2)
Output: First quartile or 25% percentile,q1 : 70.25 First quartile or 25% percentile,q1 : 65.5 Third quartile or 75% percentile,q3 : 98.0 Third quartile or 75% percentile,q3 : 99.0 IQR1 : 33.5 q1 : 70.25 q3 : 98.0 IQR2 : 27.75 IQR3 : 4.0 IQR4 : 33.5 Variance1 : 4187.79 Variance2 : 5.090909090909092 Standard_Deviation1 : 64.71313622441737 Standard_Deviation2 : 2.256304299271065
Empirical Rule:
The empirical rule states that nearly all of the data will fall within three standard deviations of the mean for a normal distribution.
- 68% of data fall within one SD from mean
- 95% of data falls within two SD from mean
- 99.7% of data fall within three SD from mean
Five number summary:
A five-number summary is especially useful in descriptive analyses.
A summary consists of five values after presented together and ordered from lowest to highest:
- Minimum value:= Q1 – 1.5 * IQR
- Lower quartile (Q1)= 25% percentile
- Median value (Q2)= 50% percentile
- Upper quartile (Q3)= 25% percentile
- Maximum value= = Q3 + 1.5 * IQR
** Note: IQR = Q3 – Q1
Outliers:
It is an abnormal value or abnormal distance from the rest of the data points
Python implementation of finding outliers:
# importing library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
# Define dataset
student_age = [22,25,30,33,24,22,21,22,23,24,26,28,26,29,29,30,31,20,45,15]
# Find outlier using z-score
# Defining function
outliers = []
def detect_outliers(data):
threshold = 3 ## 3rd standard deviation from emphirical rule
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
if np.abs(z_score)>threshold:
outliers.append(i)
return outliers
#Finding outlier using created function
detect_outliers(student_age)
Output: [45]
# Find outlier using IQR
# sort data
student_age = sorted(student_age)
print("student_age :",student_age)
# calculating q1 & q3
q1,q3 = np.percentile(student_age,[25,75])
print("q1 :",q1,"q3 :",q3)
# calculting iqr
iqr = q3 - q1
print("iqr :",iqr)
# Finding lower bound(min value) and upper bound(max value)
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print("lower_bound :",lower_bound,"upper_bound :",upper_bound)
# Finding outlier
outliers = []
for i in student_age:
if i<lower_bound or i>upper_bound:
outliers.append(i)
print("outliers :",outliers)
Output: student_age : [15, 20, 21, 22, 22, 22, 23, 24, 24, 25, 26, 26, 28, 29, 29, 30, 30, 31, 33, 45] q1 : 22.0 q3 : 29.25 iqr : 7.25 lower_bound : 11.125 upper_bound : 40.125 outliers : [45]
# imputing outlier
student_age1=pd.Series(student_age)
student_age1.loc[student_age1>upper_bound] = np.mean(student_age1)
print(student_age1)
Output: 0 15.00 1 20.00 2 21.00 3 22.00 4 22.00 5 22.00 6 23.00 7 24.00 8 24.00 9 25.00 10 26.00 11 26.00 12 28.00 13 29.00 14 29.00 15 30.00 16 30.00 17 31.00 18 33.00 19 26.25 dtype: float64
# Find outlier using boxplot
import seaborn as sns
#Before outlier removal
plt.subplot(2,2,1)
sns.boxplot(student_age,orient="h",color="salmon")
plt.title('Before outlier removal',color="blue")
#After outlier removal
plt.subplot(2,2,2)
sns.boxplot(student_age1,orient="h",color="darkred")
plt.title('After outlier removal',color="blue")
Output: Text(0.5, 1.0, 'After outlier removal')