Exploratory Analysis – Data4Fashion

Measure of Central Tendency:

A measure of central tendency is a single value that describes a set of data by identifying the central position within that set of data.

3 types of central tendency as below:

Mean:
- The mean is equal to the sum of all the values in the data set divided by the number of values in the data set or average of the values
- The disadvantage of mean: it is influenced by outliers (if present in data set)
- Commonly used for imputation for normal distributed data
Median:
- Median is the middle number in a sequence of numbers after arranging numbers from smallest to largest
- Median is not influenced by outliers
- It is used for imputation for skewed distributed data
Mode:
- Mode is the value that occurs most often within a set of numbers
- It is used for categorical data
- Disadvantage of mode: there might be several mode in data set.

Python implementation for Mean, Median & Mode:

# 01. Using pandas library

import pandas as pd

# Creating data set using list

age_list =[32,35,32,37,32,38,39,40,35,35,36]

# Creating data set using 1D array(series)

age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])

print("Using 1D array or series")
print("mean :",age_series.mean())
print("median :",age_series.median())
print("mode :",age_series.mode()[0])  # Use [0] to get first mode in case of several mode

print("-----------------------------")

print("Using list")
print("mean :",age_list.mean())
print("median :",age_list.median())
print("mode :",age_list.mode())

Output:
Using 1D array or series
mean : 37.09090909090909
median : 38.0
mode : 35
dtype: int64
-----------------------------
Using list
AttributeError:'list' object has no attribute 'mode','mean','median'

# 02. Using Numpy library

import numpy as np

# Creating data set using list

age_list =[32,35,32,37,32,38,39,40,35,35,36]

# Creating data set using 1D array(series)

age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])

print("Using 1D array or series")
print("mean :",np.mean(age_series))
print("median :",np.median(age_series))
#print("mode :",np.mode(age_series)) #Not executed as 'numpy' has no attribute 'mode'

print("-----------------------------")

print("Using list")
print("mean :",np.mean(age_list))
print("median :",np.median(age_list))
#print("mode :",np.mode(age_list)) #Not executed as 'numpy' has no attribute 'mode'

Output:
Using 1D array or series
mean : 37.09090909090909
median : 38.0
-----------------------------
Using list
mean : 35.54545454545455
median : 35.0

# 03. Using Statistics library

import statistics as stats

# Creating data set using list

age_list =[32,35,32,37,32,38,39,40,35,35,36]

# Creating data set using 1D array(series)

age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])

print("Using 1D array or series")
print("mean :",stats.mean(age_series))
print("median :",stats.median(age_series))
print("mode :",stats.mode(age_series))

print("-----------------------------")

print("Using list")
print("mean :",stats.mean(age_list))
print("median :",stats.median(age_list))
print("mode :",stats.mode(age_list))

#Statistics module can return mean,median,mode from both list & array

Output:
Using 1D array or series
mean : 37.09090909090909
median : 38
mode : 39
-----------------------------
Using list
mean : 35.54545454545455
median : 35
mode : 32

# 04. Using scipy library

from scipy import stats

# Creating data set using list

age_list =[32,35,32,37,32,38,39,40,35,35,36]

# Creating data set using 1D array(series)

age_series =pd.Series([39,35,33,37,38,38,39,40,35,35,39])

print("Using 1D array or series")
#print("mean :",stats.mean(age_series)) #Not executed as 'stats' has no attribute 'mean'
#print("median :",stats.median(age_series)) #Not executed as 'stats' has no attribute 'median'
print("mode :",stats.mode(age_series))

print("-----------------------------")

print("Using list")
#print("mean :",stats.mean(age_list)) #Not executed as 'stats' has no attribute 'mean'
#print("median :",stats.median(age_list)) #Not executed as 'stats' has no attribute 'median'
print("mode :",stats.mode(age_list))

Using 1D array or series
mode : ModeResult(mode=array([35]), count=array([3]))
-----------------------------
Using list
mode : ModeResult(mode=array([32]), count=array([3]))

Data Variability or Measure of Dispersion:

Spread: How the data is dispersed.

Commonly used measure of dispersion:

Range: Maximum value – Minimum value
Interquartile Range: Distance between the third and the first quartile(Q3-Q1). It is a way to measure the spread of the middle 50% of a dataset
- Percentiles: It is a value below which a given percentage of observations in a group of observations fall.

Location Lp = (n+1)*P/100, where n= No of observations, P= desired percentile.

For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the data points in the dataset fall.

- Quartiles : A quartile is a type of quantile.

The first quartile (Q1), is defined as the middle number between the smallest number and the median of the data set,(n+1)/4.

The second quartile (Q2) – median of the given data set,(n+1)/2

The third quartile (Q3), is the middle number between the median and the largest value of the data set,3(n+1)/4

Variance: Variance is a measure of the width of a distribution or How much spread the data is.

Variance (σ2) = ∑(X−μ)2/N, where X is data points, μ is mean, N is number of values

Standard Deviation: It is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

S.D. = √σ2 = σ

Z-Score: Indicates how many standard deviation an element is from the mean.

Formula: z = (X – μ) / σ

Python implementation for percentile, quartile, interquartile range, variance, Standard deviation

# importing library
import numpy as np
import pandas as pd
from scipy import stats

# Creating dataset
data=[346,47,56,2,36,39,75,79,79,88,89,91,92,93,96,97,101,105,112,115]
data1=pd.Series(data)

# Calculating Quartile

  #First quartile or 25% percentile
q1 = data1.quantile(0.25)                         #Using pandas 1d array
print("First quartile or 25% percentile,q1 :",q1)

q1 = np.percentile(data, 25, method="midpoint")  #method was interpolation previously
print("First quartile or 25% percentile,q1 :",q1)

  #Third quartile or 75% percentile
q3 = data1.quantile(0.75)                         #Using pandas 1d array
print("Third quartile or 75% percentile,q3 :",q3)

q3 = np.percentile(data, 75,method="midpoint")  #method was interpolation previously
print("Third quartile or 75% percentile,q3 :",q3)

# Calculating IQR

IQR1=q3-q1           # Direct formula
print("IQR1 :",IQR1)

q3, q1 = np.percentile(data,[75,25])   #Using numpy & list
IQR2 = q3-q1
print("q1 :",q1)
print("q3 :",q3)
print("IQR2 :",IQR2)

q75, q25 = np.percentile(age_series,[75,25])  #Using numpy & pandas
IQR3 = q75-q25
print("IQR3 :",IQR3)

IQR4=stats.iqr(data,interpolation="midpoint")   # Using stats
print("IQR4 :",IQR4)

## Note: Interpolation & method default is linear(i + (j - i)*fraction) , 
## if we use midpoint((i + j)/2) average of two value will be returned

# Calculating variance

Variance1=np.var(data,ddof = 1)          #Using numpy  , Formula divide by N-1
print("Variance1 :",Variance1)

Variance2=age_series.var(ddof = 1)      #Using pandas , Formula divide by N
print("Variance2 :",Variance2)

# Calculating Standard Deviation

Standard_Deviation1=np.std(data,ddof = 1)                   #Using numpy , Formula divide by N-1
print("Standard_Deviation1 :",Standard_Deviation1)

Standard_Deviation2=age_series.std(ddof = 1)               #Using pandas , Formula divide by N
print("Standard_Deviation2 :",Standard_Deviation2)

Output:
First quartile or 25% percentile,q1 : 70.25
First quartile or 25% percentile,q1 : 65.5
Third quartile or 75% percentile,q3 : 98.0
Third quartile or 75% percentile,q3 : 99.0
IQR1 : 33.5
q1 : 70.25
q3 : 98.0
IQR2 : 27.75
IQR3 : 4.0
IQR4 : 33.5
Variance1 : 4187.79
Variance2 : 5.090909090909092
Standard_Deviation1 : 64.71313622441737
Standard_Deviation2 : 2.256304299271065

Empirical Rule:

The empirical rule states that nearly all of the data will fall within three standard deviations of the mean for a normal distribution.

68% of data fall within one SD from mean
95% of data falls within two SD from mean
99.7% of data fall within three SD from mean

Five number summary:

A five-number summary is especially useful in descriptive analyses.

A summary consists of five values after presented together and ordered from lowest to highest:

Minimum value:= Q1 – 1.5 * IQR
Lower quartile (Q1)= 25% percentile
Median value (Q2)= 50% percentile
Upper quartile (Q3)= 25% percentile
Maximum value= = Q3 + 1.5 * IQR

** Note: IQR = Q3 – Q1

Outliers:

It is an abnormal value or abnormal distance from the rest of the data points

Python implementation of finding outliers:

# importing library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

# Define dataset

student_age = [22,25,30,33,24,22,21,22,23,24,26,28,26,29,29,30,31,20,45,15]

# Find outlier using z-score

# Defining function
outliers = []

def detect_outliers(data):
  threshold = 3 ## 3rd standard deviation from emphirical rule
  mean = np.mean(data)
  std = np.std(data)

  for i in data:
    z_score = (i-mean)/std
    if np.abs(z_score)>threshold:
      outliers.append(i)
  return outliers

#Finding outlier using created function

detect_outliers(student_age)

Output:
[45]

# Find outlier using IQR

# sort data
student_age = sorted(student_age)
print("student_age :",student_age)

# calculating q1 & q3
q1,q3 = np.percentile(student_age,[25,75])
print("q1 :",q1,"q3 :",q3)

# calculting iqr
iqr = q3 - q1
print("iqr :",iqr)

# Finding lower bound(min value) and upper bound(max value)
lower_bound = q1 - (1.5*iqr)
upper_bound = q3 + (1.5*iqr)
print("lower_bound :",lower_bound,"upper_bound :",upper_bound)


# Finding outlier
outliers = []

for i in student_age:
  if i<lower_bound or i>upper_bound:
    outliers.append(i)

print("outliers :",outliers)

Output:
student_age : [15, 20, 21, 22, 22, 22, 23, 24, 24, 25, 26, 26, 28, 29, 29, 30, 30, 31, 33, 45]
q1 : 22.0 q3 : 29.25
iqr : 7.25
lower_bound : 11.125 upper_bound : 40.125
outliers : [45]

# imputing outlier

student_age1=pd.Series(student_age)
student_age1.loc[student_age1>upper_bound] = np.mean(student_age1)
print(student_age1)

Output:
0     15.00
1     20.00
2     21.00
3     22.00
4     22.00
5     22.00
6     23.00
7     24.00
8     24.00
9     25.00
10    26.00
11    26.00
12    28.00
13    29.00
14    29.00
15    30.00
16    30.00
17    31.00
18    33.00
19    26.25
dtype: float64

# Find outlier using boxplot

import seaborn as sns

#Before outlier removal
plt.subplot(2,2,1)
sns.boxplot(student_age,orient="h",color="salmon")
plt.title('Before outlier removal',color="blue")

#After outlier removal
plt.subplot(2,2,2)
sns.boxplot(student_age1,orient="h",color="darkred")
plt.title('After outlier removal',color="blue")

Output:
Text(0.5, 1.0, 'After outlier removal')

Measure of Central Tendency:

Python implementation for Mean, Median & Mode:

Data Variability or Measure of Dispersion:

Python implementation for percentile, quartile, interquartile range, variance, Standard deviation

Empirical Rule:

Outliers:

Python implementation of finding outliers:

Register

Login here

Forgot your password?

Subscribe to our email list

Data Driven Fashion