**Addition Rule of Probability:**

When calculating the probability of either one of two events from occurring, it is as simple as adding the probability of each event and then subtracting the probability of both of the events occurring: P(A or B) = P(A) + P(B) – P(A and B)

If the 2 events are mutually exclusive, If they both cannot happen, P(A and B) is 0. Therefore: P(A or B) = P(A) + P(B)

**Multiplication Rule of Probability:**

If A and B are **dependent **events, then the probability of both events occurring simultaneously is given by: P(A ∩ B) = P(B) . P(A|B)

If A and B are two **independent** events in an experiment, then the probability of both events occurring simultaneously is given by: P(A ∩ B) = P(A) . P(B)

**Bernoulli Distribution:**

Also called the **binary **distribution

The Bernoulli distribution is a discrete probability distribution that models a random experiment with two possible outcomes: success and failure.

It is often used to represent binary events, where the probability of success (usually denoted as “p”) and the probability of failure (which is complementary to p and denoted as “q”) are constant for each trial.

The probability mass function (PMF) of the Bernoulli distribution is defined as follows:

P(X =x)= {p if x = 1 , q = 1 − p if x = 0 }

Where:

- P(X=x) is the probability that the random variable X takes the value x, which can be either 1 (success) or 0 (failure).
- p is the probability of success (e.g., the probability of an event occurring).
- q is the probability of failure (complementary to p, i.e.,q=1−p).
- In mathematical notation, you can represent the Bernoulli distribution as: X∼Bernoulli(p)
- This means that the random variable X follows a Bernoulli distribution with probability of success p.
- The expected value (mean) and variance of a random variable X following a Bernoulli distribution are as follows:
- Expected Value (Mean): E(X)=p
- Variance: Var(X)=p(1−p)
- Standard Deviation: σ= √ p(1-p)
- These values describe the central tendency and variability of the Bernoulli distribution.

**Binomial Distributions:**

Binomial distribution is a collection, n, of independent Bernoulli events.

An event being independent means that the results of the next event are not affected by the results of the previous event.

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose, and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.

The Bernoulli distribution deals with a **single **trial with two possible outcomes, while the binomial distribution deals with a **fixed number of independent and identical trials.**

**Poisson Distribution:**

Poisson Distribution is applicable in situations where events occur at random points of time and space wherein our interest lies only in the number of occurrences of the event.

Example:

- The number of emergency calls recorded at a hospital in a day.
- The number of thefts reported in an area in a day.
- The number of customers arriving at a salon in an hour.

Some notations used in Poisson distribution are:

- λ is the rate at which an event occurs,
- t is the length of a time interval,
- X is the number of events in that time interval.

Here, X is called a Poisson Random Variable, and the probability distribution of X is called Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The probability mass function (PMF) of the Poisson distribution is given by:

**P(X = k) = (e ^{–} ^{λ} . λ^{k}) / (k!) , k = 0,1,…**

Where:

**P(X = k)**is the probability of observing**k**events.**e**is the base of the natural logarithm ( approximately 2.71828).**λ**is the average rate of events in the given interval.**k**is the number of events you want to find the probability for.**k!**is the factorial of**k.**

Example: Suppose you work at a call center, and on average, you receive 5 customer service calls per hour. You want to calculate the probability of receiving a specific number of calls in the next hour.

Probability of receiving exactly 3 calls in the next hour (k = 3 and λ = 5):

**P(X = 3) = (e ^{-5}.5^{3}) / 3! ≈ 0.1404**

So, there is a roughly 14.04% chance of receiving exactly 3 calls in the next hour

**Normal Distribution:**

Also known as Gaussian distribution

**Example**: Age, weight, height, Iris dataset

**Importance**:

**Hypothesis test**assumes data follows it- Linear & Non-Linear regression assumes
**residual**follows it - Central limit theorem states as the
**sample size**increase the distribution of the mean follows normal distribution irrespective of the distribution of the original variable - Most
**statistical software programs**support of the probability functions for normal distribution

**Parameter**: Two main parameters as below which changes lead to change of shape of the distribution of shape

**Mean:**Determine the location of the peak & data points are clustered around the mean. Changing mean, the curve moves either to the left or right to the X-axis.

**Standard Deviation:**Determine how far data points are away from the mean & represent the distance between the mean and data points. Changing the value of SD tightens(steep curve) or expands(flatter curve) the width of the distribution along the X-axis.

**Properties**:

**Symmetric:**Equal number of observations lie on each side of the curve or mean value.**mean=median=mode:**All three measures of central tendency fall in midpoint.**Empirical Rule:**Almost all values lie within 3 standard deviations of the mean.**68%-95%-99.7%**which is called the 3 sigma rule.**Skewness and kurtosis:**Determine how different the distribution is from normal distribution. Skewness measures symmetry & kurtosis measures the thickness of tail distribution**Area under curve:**1**Standard Normal distribution:**μ = 0 and σ = 1

**Distribution functions: **

**Probability Density Function**

**Cumulative Density Function**

**Normality testing – Skewness & Kurtosis:**

**Skewness :**

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Basically it measures the level of how much a given distribution is different from a normal distribution (which is symmetric)

Skewness can be **Positively Skewed** or **Negatively skewed**.

**Positively Skewed:**

- Distribution skewed to the
**right** - Tail spread to the
**right** - Mode < Median < Mean
- Example: Wealth distribution, length of comment on YouTube

**Negatively Skewed:**

- Distribution skewed to the
**left** - Tail spread to the
**left** - Mean < Median < Mode
- Example: Life span of a human being(less number of people die at an early age)

**Measure of Skewness :**

Skewness = 0: Then normally distributed

Skewness > 0: Positively skewed

Skewness < 0: Negatively skewed

**Formula:**

Pearson’s Coefficient = (Mean-Mode)/Standard Deviation

Pearson’s Coefficient = 3(Mean-Median)/ Standard Deviation

If value is positive , skewness is positive

If value is negative , skewness is negative

**Kurtosis :**

In statistics, kurtosis is a measure of relative peakedness of a probability distribution, or alternatively how heavy or how light its tails are.

**Positive excess kurtosis**— when excess kurtosis, given by (kurtosis-3), is**positive**, then the distribution has a**sharp peak**and is called a**leptokurtic**

For Fisher’s definition, kurtosis > 0

For Pearson’s definition, kurtosis > 3

**Negative excess kurtosis**— when excess kurtosis, given by (kurtosis-3), is**negative**, then the distribution has a**flat peak**and is called a**platykurtic**

For Fisher’s definition, kurtosis < 0

For Pearson’s definition, kurtosis < 3

**Zero excess kurtosis**— when excess kurtosis, given by (kurtosis-3), is**zero**, then the distribution follows a**normal**distribution and is also called a**mesokurtic**

For Fisher’s definition, kurtosis = 0

For Pearson’s definition, kurtosis = 3

**Python implementation for Skewness & kurtosis:**

```
# Importing library
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import skew
from scipy.stats import kurtosis
# Creating data for skewness
Skewed_data=[88, 85, 82, 97, 67, 77, 74, 86, 81, 95, 77, 88, 85, 76, 81]
Skewed_data_df= pd.Series(Skewed_data)
print("Skewness importing skew : ",skew(Skewed_data,bias=False))
print("Skewness importing stats :",stats.skew(Skewed_data,bias=False))
print("Skewness using array :",Skewed_data_df.skew())
print("-------------------------------------------------------")
# Creating data Kurtosis
Kurtosis_data=[88, 85, 82, 97, 67, 77, 74, 86, 81, 95, 77, 88, 85, 76, 81]
Kurtosis_data_df= pd.Series(Kurtosis_data)
print("Kurtosis importing Kurtosis : ",kurtosis(Kurtosis_data,bias=False,fisher=True))
print("Kurtosis importing stats :",stats.kurtosis(Kurtosis_data,bias=False,fisher=False))
print("Kurtosis using array :",Kurtosis_data_df.kurtosis())
#Note:
#If fisher=True, Fisher's definition is used (where, normal = 0.0).
#If fisher=False,Pearson's definition is used (where, normal = 3.0).
#We use the argument bias=False to calculate the sample skewness and kurtosis as opposed to the population skewness and kurtosis.
```

Output: Skewness importing skew : 0.0326966578855933 Skewness importing stats : 0.0326966578855933 Skewness using array : 0.0326966578855933 ------------------------------------------------------- Kurtosis importing Kurtosis : 0.11815715154945083 Kurtosis importing stats : 3.118157151549451 Kurtosis using array : 0.11815715154945172

**Central Limit Theorem:**

**Definition:**The central limit theorem states that the distribution of sample means approximates a normal distribution as the sample size gets larger (assuming that all samples are identical in size), regardless of population distribution shape.**Easy explanation:**Whether my distribution is normal distribution or not normal distribution, if I take several samples where sample size n>=30 & calculate sample mean, then if I plot all sample means, it will give me normal distribution.**Properties:**- Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
- Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n
- For n > 30, the sampling distribution becomes a normal distribution.

**Python implementation to understand Central Limit Theorem:**

```
# importing library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42) # to make analysis reproducible
# Loading dataset
data = pd.read_csv('/content/sample_data/california_housing_train.csv')
data.rename(columns = {'housing_median_age':'age'}, inplace = True)
# Using age for our purposes to see population distribution
sns.displot(data.age,kde='hist')
# Not looks like normal distribution
# Mean & standard deviation for population
print('Population mean :',data.age.mean())
print('Population std :',data.age.std())
```

Output: Population mean : 28.58935294117647 Population std : 12.586936981660399

```
# start sampling & make sample distribution
sam_size = 30
sample_means =pd.Series([data.age.sample(sam_size).mean() for i in range(1000)]) # pd.Series used to convert list Series to get mean , std
print('No of sample means :',len(sample_means))
print('Sample mean :',sample_means.mean())
print('Sample std :',sample_means.std())
# Plotting the density for the sample means.
sns.distplot(sample_means,kde='hist',color="darkblue")
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title('Sampling Distribution of the Sample Mean (Central Limit Theorem)')
# Looks normal distribution
```

Output: No of sample means : 1000 Sample mean : 28.521800000000002 Sample std : 2.270686312965189

**Log Normal distribution:**

- In probability theory, a log normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.
- If x is log normal distribution , y = ln(x) is normal distribution. (ln is natural log)
- To reverse x = exp(y).

**Python implementation to see log normal distribution:**

```
# Creating log normal distribution as we dont have real data
log_normal_dis_data=np.random.lognormal(4,1,10000)
plt.subplot(2,2,1)
sns.distplot(log_normal_dis_data)
plt.title("Log Normal distribution")
Normal_dis_data = np.log(log_normal_dis_data)
plt.subplot(2,2,2)
sns.distplot(Normal_dis_data)
plt.title("Normal distribution")
# To convert from normal to log normal
log_normal_dis_data=np.exp(Normal_dis_data)
plt.subplot(2,2,3)
sns.distplot(log_normal_dis_data)
```

**Power Law Distribution:**

- Power law is a functional relationship between two quantities where a relative change in one quantity results in a relative change in the other quantity proportional to a power of the change.
- One quantity varies as a power of another.
- Example: 80% of wealth is distributed 20% of the total population

**Python implementation to understand Power Law Distribution:**

```
import numpy as np
import matplotlib.pyplot as plt
# Parameters for the Power Law distribution
alpha = 2.5 # shape parameter
xmin = 1.0 # minimum value
# Generate data points for x-axis
x = np.linspace(1, 10, 1000) # Range of x values
# Calculate the probability density function (PDF) for each data point
pdf = (alpha-1) * xmin**(alpha-1) * (1/x)**alpha
# Plot the PDF
plt.plot(x, pdf, color='blue', label='Power Law PDF')
# Add labels and title
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Power Law Probability Density Function')
plt.legend()
plt.grid(True)
# Show the plot
plt.show()
```

**Pareto Distribution:**

- It is named after Italian civil engineer Vilfredo Pareto
- It is based on the power law probability distribution
- It is also called the 80-20 rule
- Example: a large portion of wealth is held by a small fraction of the population

**Python implementation to understand Pareto Distribution:**

```
import numpy as np
import matplotlib.pyplot as plt
# Parameters for the Pareto distribution
alpha = 2.5 # shape parameter
xm = 1.0 # scale parameter (minimum value)
# Generate data points for x-axis
x = np.linspace(0.1, 10, 1000) # Range of x values
# Calculate the probability density function (PDF) for each data point
pdf = (alpha * xm**alpha) / (x**(alpha+1))
# Plot the PDF
plt.plot(x, pdf, color='red', label='Pareto PDF')
# Add labels and title
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Pareto Probability Density Function')
plt.legend()
plt.grid(True)
# Show the plot
plt.show()
```

**BOX COX Transformation:**

The statisticians George Box and David Cox developed a procedure to identify an appropriate exponent (Lambda = l) to use to transform data into a “normal shape.” The Lambda value indicates the power to which all data should be raised.

**Python implementation for box cox transformation from non-normal data to normal data:**

```
# Creating non normal distribution as we dont have real data
Non_normal_dis_data=np.random.exponential(10,1000)
plt.subplot(2,2,1)
sns.distplot(Non_normal_dis_data)
plt.title("Non_normal_dis_data")
# Transform to normal Distribution using boxcox
Normal_dis_data,fitted_lambda = stats.boxcox(Non_normal_dis_data)
plt.subplot(2,2,2)
sns.distplot(Normal_dis_data)
plt.title("Normal distribution")
Normal_dis_data = np.log(Non_normal_dis_data)
plt.subplot(2,2,3)
sns.distplot(Normal_dis_data)
plt.title("Try for Normal distribution by log")
# rescaling the subplots
plt.tight_layout()
print(f"Lambda value used for Transformation: {fitted_lambda}")
```

Output: Lambda value used for Transformation: 0.22930731384394407