Linear Regression:
What:
- It is a supervised machine-learning algorithm
- Use to predict the value of a variable based on the value of another variable.
- Use for Predictive Analysis.
- Use to determine the linear relationship between dependent variable (y) and independent variables (x)
- This linear relationship is represented by a straight line called regression line/best-fit line
- This is the pattern on which the machine has learned from the data
- Used for predicting the output of quantitative type (continuous value) eg. Age, salary, price etc.
- Regression line range is – ∞ to + ∞
Types:
- Simple Linear Regression:
- Formula y = mx + c
Where,
- y is the response or target variable
- x is the predictor variable/input
- m is the slope/coefficient of x
- c is the intercept /constant
- Multiple Linear Regression:
- Formula y = m1x1 + m2x2 + …+ mnxn + c
Where,- y is the response or target variable
- x1, x2, x3, …xn represents the features
- m1, m2, m3, …mn represents the coefficient of x1, x2, x3, …xn respectively
- c is the intercept /constant
Loss/Cost Function:
- It is the function that signifies how much the predicted values are deviated from the actual values.
- MSE is the most commonly used cost function for linear regression .
- MSE is the sum of the squared difference between the predicted and actual value.
- Output of MSE is the single number representing the cost.
Replace Yi pred with mxi+c
Gradient Descent:
It is an optimization algorithm used to find the optimal value of parameters that minimizing the cost function.
If we update variables or parameters of some cost function in the direction of the negative gradient in an iterative manner to reach the minimum of some cost function is called gradient descent algorithm.
It helps to get optimal value for the slope m which provides the best fit line.
Our aim it to minimize the error between the predicted values and the actual values.
The gradient descent curve has the cost function and slope values.
This algorithm starts with a randomly selected m value and from there it uses calculus to iteratively adjust the values of m and calculate cost function for all the slopes.
So , it takes all the error values and searches for the minimum error, it creates a best fit line using that m.
For the randomly selected m, it might not result in global minimum. So, we need to move down and for that we use convergence theorem
- N.B:The convergence theorem is a mathematical concept that describes the behavior of a sequence or a series of values as it approaches a specific limit. Some examples of convergence theorems include the Monotone Convergence Theorem, the Cauchy Convergence Theorem, and the Bolzano-Weierstrass Theorem. Although there may not be a specific convergence theorem for gradient descent in linear regression,
- Learning rate should be a small value ranging between 0.1 to 0.0000001. Learning rate gives the rate of speed where the gradient moves during gradient descent. Setting it too high would make your path instable, too low would make convergence slow. Put it to zero means your model isn’t learning anything from the gradients.
To find the derivatives of slope, we need to draw a tangent from that point.
If the right-hand side of the tangent is facing towards down then its a negative slope. So, the derivative of that slope will also be negative . Hence we need to increase the m value to move towards the global minima.
If the right-hand side of the tangent is facing towards left then its a positive slope. So, the derivative of that slope will also be positive . Hence we need to reduce the m value to move towards the global minima.
Fig: Gradient descent algorithm
Assumption of Linear Regression Model:
Linearity: Linear relationship exists between dependent and independent variable .
In case of non-linearity use transformation such as logarithmic, exponential , square root etc
No Multicollinearity: If there is multicollinearity its unclear which independent variable explains the dependent variable
Errors are normally distributed . If not then, confidence interval may become too wide or narrow
Pros:
- Simple method
- Easy to use and understand
Cons:
- Very sensitive to outliers
- Performs well for linearly separable datasets only
Python Implementation for Linear Regression:
Business Case: To predict total sales by using features like money spent on marketing on individual items.
Received from Domain expert
- TV :- Amount spend on TV Advertisement.
- Radio :-Amount spend on Radio Advertisement.
- Newspaper :-Amount spend on Newspaper advertisement.
- Sales :-Sales of Product.
# importing basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Load dataset
sales_data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /01_Linear Regression/Linear Regression Class/Data Set/Advertising.csv')
Basic Checks & Domain Analysis:
sales_data.head()
Output: Unnamed: 0 TV Radio Newspaper Sales 0 1 230.1 37.8 69.2 22.1 1 2 44.5 39.3 45.1 10.4 2 3 17.2 45.9 69.3 9.3 3 4 151.5 41.3 58.5 18.5 4 5 180.8 10.8 58.4 12.9
sales_data.tail()
Output: Unnamed: 0 TV Radio Newspaper Sales 195 196 38.2 3.7 13.8 7.6 196 197 94.2 4.9 8.1 9.7 197 198 177.0 9.3 6.4 12.8 198 199 283.6 42.0 66.2 25.5 199 200 232.1 8.6 8.7 13.4
sales_data.describe()
Output: Unnamed: 0 TV Radio Newspaper Sales count 200.000000 200.000000 200.000000 200.000000 200.000000 mean 100.500000 147.042500 23.264000 30.554000 14.022500 std 57.879185 85.854236 14.846809 21.778621 5.217457 min 1.000000 0.700000 0.000000 0.300000 1.600000 25% 50.750000 74.375000 9.975000 12.750000 10.375000 50% 100.500000 149.750000 22.900000 25.750000 12.900000 75% 150.250000 218.825000 36.525000 45.100000 17.400000 max 200.000000 296.400000 49.600000 114.000000 27.000000
sales_data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 200 non-null int64 1 TV 200 non-null float64 2 Radio 200 non-null float64 3 Newspaper 200 non-null float64 4 Sales 200 non-null float64 dtypes: float64(4), int64(1) memory usage: 7.9 KB
sales_data.shape
Output: (200, 5)
Observation from basic checks
- Column Unnamed is id column not necessary in modeling, so we will drop this column
- No of rows are 200 & columns are 5
- 3 independent variables ( TV,Radio, Newspaper)
- 1 dependent or target variable Sales which depends on other 3 independent variable
- All four variable values are not in same scale, so scaling required
- No missing value present in data set
- Expense on TV advertisement is high comparing to radio & newspaper
- No categorical value present in the data set
Exploratory Data Analysis:
Step 01:Univariate Analysis
# Analyzing TV
sns.histplot(x=sales_data["TV"],kde=True)
# Analyzing Radio
sns.histplot(x=sales_data["Radio"],kde=True)
# Analyzing Newspaper
sns.histplot(x=sales_data["Newspaper"],kde=True)
Observation from Univariate Analysis
- No Pattern in TV & Radio data
- Newspaper data is RIght Skewed
Step 02:Bivariate Analysis
# Analyzing TV And Sales
sns.relplot(x='TV',y='Sales',data=sales_data)
# Analyzing Radio And Sales
sns.relplot(x='Radio',y='Sales',data=sales_data)
# Analyzing Newspaper And Sales
sns.relplot(x='Newspaper',y='Sales',data=sales_data)
Observation from Bivariate Analysis
- The marketing on TV is leading to more sales in the product
- No specific trend is showing for Radio advertising on sales
- No specific trend is showing for Newspaper advertising on sales
Step 03:Multivariate Analysis
# Analysis all 3 independent variables with sales variable
sns.pairplot(sales_data.drop('Unnamed: 0',axis=1))
Data Preprocessing and Feature Engineering:
Step 01: Imputing Missing values
- As no missing value is data set , we are skipping this step
Step 02: Converting categorical data to numerical data
- As no categorcal data , we are skipping this step
Step 03: Checking & handling Outliers
# Checking for TV data
sns.boxplot(x='TV',data=sales_data)
# Checking for Radio data
sns.boxplot(x='Radio',data=sales_data)
# Checking for Newspaper data
sns.boxplot(x='Newspaper',data=sales_data)
Observation from Checking Outlier
only newspaper data has outliers
We are not removing this outlier for now
Step 04: Scaling down the continuous variable
- Although our all variable are not in same scale , we are not performing scaling technique now to keep this blog easier
Step 05: Transformation
- As our data is not normal distribution , we should transform to normal distribution. But we are not performing now to keep this blog easier
Feature Selection:
Step 01: Dropping the unwanted variables
sales_data.drop('Unnamed: 0',axis=1,inplace=True)
sales_data
Output: TV Radio Newspaper Sales 0 230.1 37.8 69.2 22.1 1 44.5 39.3 45.1 10.4 2 17.2 45.9 69.3 9.3 3 151.5 41.3 58.5 18.5 4 180.8 10.8 58.4 12.9 .. ... ... ... ... 195 38.2 3.7 13.8 7.6 196 94.2 4.9 8.1 9.7 197 177.0 9.3 6.4 12.8 198 283.6 42.0 66.2 25.5 199 232.1 8.6 8.7 13.4 [200 rows x 4 columns]
Step 02: Checking the Correlation
- We will use heatmap here
sns.heatmap(sales_data.drop("Sales",axis=1).corr(),annot=True) # dropping sales data as it output
Observation from correlation
No feature is highly correlated with other feature
So we will use all 3 features as input
Model Creation:
Step 01: Creating independent & dependent variable
Commonly independent variable represent by X
Commonly dependent variable is represent by y
X = sales_data.iloc[:,0:3]
y = sales_data.iloc[:,3]
Step 02: Creating Training & Testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2 ,random_state=45)
Step 03: Creating Model
from sklearn.linear_model import LinearRegression
model = LinearRegression() # object creation
model.fit(X_train,y_train) # training linear regression
y_predict= model.predict(X_test)
# See predicted y values & actual y value
print(y_predict)
print('----------------------------------')
print(np.array(y_test))
Output: [15.18887309 10.2054111 16.43931961 21.80818887 15.88752137 8.92680199 18.13567301 11.36589433 17.39755473 8.66950442 11.4822015 9.719351 12.1396776 19.13491661 16.94206504 6.52793621 14.05605199 7.77833624 21.09549852 12.35393889 19.24140535 7.51159355 17.35753103 10.14557775 17.14293028 7.03827428 20.44646647 12.24372302 15.01515604 14.31985601 23.18859932 20.39708782 19.89616957 16.52262551 9.97604212 10.09042996 16.8580678 18.25948647 13.17938188 19.53806065] ---------------------------------- [14.9 8.8 16.6 23.8 12. 9.7 19. 11.8 18.5 8.5 10.8 10.1 11.7 17.4 15.7 8.7 14.1 9.7 22.3 10.8 19.6 7.6 12.8 10.1 17.3 8.6 20.7 11.7 15. 14.5 25.4 22.1 19.8 17.3 11.6 11.3 18. 15. 12.9 18.9]
y_predict.shape
Output: (40,)
Model Evaluation
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
#r2 Score
r2 = r2_score(y_test,y_predict)
r2
Output: 0.8955882331233612
X_test.shape
Output: (40, 3)
#Adjusted r2 score
adjusted_r2 = 1-(1-r2)*(40-1)/(40-3-1)
adjusted_r2
Output: 0.886887252550308
Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]
where:
- R2: The R2 of the model
- n: The number of observations
- k: The number of predictor variables
# mean Square Error(MSE)
MSE = mean_squared_error(y_test,y_predict)
MSE
Output: 2.256494247280935
# root mean square error(RMSE)
import math
RMSE = math.sqrt(MSE)
RMSE
Output: 1.5021631892976657
# mean absolute error
MAE = mean_absolute_error(y_test,y_predict)
MAE
Output: 1.0788802763848646