Linear Regression:


  • It is a supervised machine-learning algorithm
  • Use to predict the value of a variable based on the value of another variable.
  • Use for Predictive Analysis.
  • Use to determine the linear relationship between dependent variable (y) and independent variables (x)
  • This linear relationship is represented by a straight line called regression line/best-fit line
  • This is the pattern on which the machine has learned from the data
  • Used for predicting the output of quantitative type (continuous value) eg. Age, salary, price etc.
  • Regression line range is – ∞ to + ∞


  1. Simple Linear Regression:
  • Formula y = mx + c


    • y is the response or target variable
    • x is the predictor variable/input
    • m is the slope/coefficient of x
    • c is the intercept /constant
  1. Multiple Linear Regression:
  • Formula y = m1x1 + m2x2 + …+ mnxn + c
    • y is the response or target variable
    • x1, x2, x3, …xn represents the features
    • m1, m2, m3, …mn represents the coefficient of x1, x2, x3, …xn respectively
    • c is the intercept /constant

Loss/Cost Function:

  • It is the function that signifies how much the predicted values are deviated from the actual values.
  • MSE(Mean Squared Error) is the most commonly used cost function for linear regression.
  • MSE is the sum of the squared difference between the predicted and actual value.
  • Output of MSE is the single number representing the cost.

Replace Yi pred with mxi+c

Gradient Descent:

  • It is an optimization algorithm used to find the optimal value of parameters that minimizing the cost function.

  • If we update variables or parameters of some cost function in the direction of the negative gradient in an iterative manner to reach the minimum of some cost function is called gradient descent algorithm.

  • It helps to get optimal value for the slope m which provides the best fit line.

  • Our aim it to minimize the error between the predicted values and the actual values.

  • The gradient descent curve has the cost function and slope values.

  • This algorithm starts with a randomly selected m value and from there it uses calculus to iteratively adjust the values of m and calculate cost function for all the slopes.

  • So , it takes all the error values and searches for the minimum error, it creates a best fit line using that m.

  • For the randomly selected m, it might not result in global minimum. So, we need to move down and for that we use convergence theorem

    • N.B:The convergence theorem is a mathematical concept that describes the behavior of a sequence or a series of values as it approaches a specific limit. Some examples of convergence theorems include the Monotone Convergence Theorem, the Cauchy Convergence Theorem, and the Bolzano-Weierstrass Theorem. Although there may not be a specific convergence theorem for gradient descent in linear regression,
  • Learning rate should be a small value ranging between 0.1 to 0.0000001. Learning rate gives the rate of speed where the gradient moves during gradient descent. Setting it too high would make your path instable, too low would make convergence slow. Put it to zero means your model isn’t learning anything from the gradients.
  • To find the derivatives of slope, we need to draw a tangent from that point.

  • If the slope is negative (downward slope from left to right), then the derivative is negative, meaning we increase 𝑚 to move toward the global minimum.

  • If the slope is positive (upward slope from left to right), then the derivative is positive, meaning we decrease 𝑚 to move toward the global minimum.

                   Fig: Gradient descent algorithm

Assumption of Linear Regression Model:

  • Linearity: Linear relationship exists between dependent and independent variable .

    In case of non-linearity use transformation such as logarithmic, exponential , square root etc

  • No Multicollinearity: If there is multicollinearity its unclear which independent variable explains the dependent variable

  • Errors are normally distributed . If not then, confidence interval may become too wide or narrow


  • Simple method
  • Easy to use and understand


  • Very sensitive to outliers
  • Performs well for linearly separable datasets only

Python Implementation for Linear Regression:

Business Case: To predict total sales by using features like money spent on marketing on individual items.

Received from Domain expert

  1. TV :- Amount spend on TV Advertisement.
  2. Radio :-Amount spend on Radio Advertisement.
  3. Newspaper :-Amount spend on Newspaper advertisement.
  4. Sales :-Sales of Product.
# importing basic libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

import warnings

# Load dataset

sales_data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /01_Linear Regression/Linear Regression Class/Data Set/Advertising.csv')

Basic Checks & Domain Analysis:

   Unnamed: 0     TV  Radio  Newspaper  Sales
0           1  230.1   37.8       69.2   22.1
1           2   44.5   39.3       45.1   10.4
2           3   17.2   45.9       69.3    9.3
3           4  151.5   41.3       58.5   18.5
4           5  180.8   10.8       58.4   12.9
    Unnamed: 0     TV  Radio  Newspaper  Sales
195         196   38.2    3.7       13.8    7.6
196         197   94.2    4.9        8.1    9.7
197         198  177.0    9.3        6.4   12.8
198         199  283.6   42.0       66.2   25.5
199         200  232.1    8.6        8.7   13.4
       Unnamed: 0          TV       Radio   Newspaper       Sales
count  200.000000  200.000000  200.000000  200.000000  200.000000
mean   100.500000  147.042500   23.264000   30.554000   14.022500
std     57.879185   85.854236   14.846809   21.778621    5.217457
min      1.000000    0.700000    0.000000    0.300000    1.600000
25%     50.750000   74.375000    9.975000   12.750000   10.375000
50%    100.500000  149.750000   22.900000   25.750000   12.900000
75%    150.250000  218.825000   36.525000   45.100000   17.400000
max    200.000000  296.400000   49.600000  114.000000   27.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   Radio       200 non-null    float64
 3   Newspaper   200 non-null    float64
 4   Sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB
(200, 5)

Observation from basic checks

  1. Column Unnamed is id column not necessary in modeling, so we will drop this column
  2. No of rows are 200 & columns are 5
  3. 3 independent variables ( TV,Radio, Newspaper)
  4. 1 dependent or target variable Sales which depends on other 3 independent variable
  5. All four variable values are not in same scale, so scaling required
  6. No missing value present in data set
  7. Expense on TV advertisement is high comparing to radio & newspaper
  8. No categorical value present in the data set

Exploratory Data Analysis:

Step 01:Univariate Analysis

# Analyzing TV

# Analyzing Radio

# Analyzing Newspaper

Observation from Univariate Analysis

  1. No Pattern in TV & Radio data
  2. Newspaper data is RIght Skewed

Step 02:Bivariate Analysis

# Analyzing TV And Sales


# Analyzing Radio And Sales


# Analyzing Newspaper And Sales


Observation from Bivariate Analysis

  1. The marketing on TV is leading to more sales in the product
  2. No specific trend is showing for Radio advertising on sales
  3. No specific trend is showing for Newspaper advertising on sales

Step 03:Multivariate Analysis

# Analysis all 3 independent variables with sales variable

sns.pairplot(sales_data.drop('Unnamed: 0',axis=1))

Data Preprocessing and Feature Engineering:

Step 01: Imputing Missing values

  • As no missing value is data set , we are skipping this step

Step 02: Converting categorical data to numerical data

  • As no categorcal data , we are skipping this step

Step 03: Checking & handling Outliers

# Checking for TV data 


# Checking for Radio data 


# Checking for Newspaper data 


Observation from Checking Outlier

  • only newspaper data has outliers

  • We are not removing this outlier for now

Step 04: Scaling down the continuous variable

  • Although our all variable are not in same scale , we are not performing scaling technique now to keep this blog easier 

Step 05: Transformation

  • As our data is not normal distribution , we should transform to normal distribution. But we are not performing now to keep this blog easier

Feature Selection:

Step 01: Dropping the unwanted variables

sales_data.drop('Unnamed: 0',axis=1,inplace=True)
        TV  Radio  Newspaper  Sales
0    230.1   37.8       69.2   22.1
1     44.5   39.3       45.1   10.4
2     17.2   45.9       69.3    9.3
3    151.5   41.3       58.5   18.5
4    180.8   10.8       58.4   12.9
..     ...    ...        ...    ...
195   38.2    3.7       13.8    7.6
196   94.2    4.9        8.1    9.7
197  177.0    9.3        6.4   12.8
198  283.6   42.0       66.2   25.5
199  232.1    8.6        8.7   13.4

[200 rows x 4 columns]

Step 02: Checking the Correlation

  • We will use heatmap here
sns.heatmap(sales_data.drop("Sales",axis=1).corr(),annot=True) # dropping sales data as it output

Observation from correlation

  • No feature is highly correlated with other feature

  • So we will use all 3 features as input

Model Creation:

Step 01: Creating independent & dependent variable

  • Commonly independent variable represent by X

  • Commonly dependent variable is represent by y

X = sales_data.iloc[:,0:3]
y = sales_data.iloc[:,3]

Step 02: Creating Training & Testing data

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2 ,random_state=45)

Rule of Thumb: Always split your dataset into train and test BEFORE any preprocessing that involves the entire dataset to avoid data leakage.

Step 03: Creating Model

from sklearn.linear_model import LinearRegression

model = LinearRegression() # object creation,y_train) # training linear regression

y_predict= model.predict(X_test)
# See predicted y values & actual y value

[15.18887309 10.2054111  16.43931961 21.80818887 15.88752137  8.92680199
 18.13567301 11.36589433 17.39755473  8.66950442 11.4822015   9.719351
 12.1396776  19.13491661 16.94206504  6.52793621 14.05605199  7.77833624
 21.09549852 12.35393889 19.24140535  7.51159355 17.35753103 10.14557775
 17.14293028  7.03827428 20.44646647 12.24372302 15.01515604 14.31985601
 23.18859932 20.39708782 19.89616957 16.52262551  9.97604212 10.09042996
 16.8580678  18.25948647 13.17938188 19.53806065]
[14.9  8.8 16.6 23.8 12.   9.7 19.  11.8 18.5  8.5 10.8 10.1 11.7 17.4
 15.7  8.7 14.1  9.7 22.3 10.8 19.6  7.6 12.8 10.1 17.3  8.6 20.7 11.7
 15.  14.5 25.4 22.1 19.8 17.3 11.6 11.3 18.  15.  12.9 18.9]

Model Evaluation

from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
#r2 Score

r2 = r2_score(y_test,y_predict)
(40, 3)
#Adjusted r2 score

adjusted_r2 = 1-(1-r2)*(40-1)/(40-3-1)

Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]


  • R2: The R2 of the model
  • n: The number of observations
  • k: The number of predictor variables
# mean Square Error(MSE)

MSE = mean_squared_error(y_test,y_predict)
# root mean square error(RMSE)

import math

RMSE = math.sqrt(MSE)
# mean absolute error

MAE = mean_absolute_error(y_test,y_predict)


