Exploratory Data Analysis (EDA)

  • Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
  • EDA typically involves generating summary statistics for numerical data and visualizing data distributions through histograms, box plots, scatter plots, etc.
  • It helps analysts and data scientists understand what the data can tell us beyond the formal modeling or hypothesis-testing tasks.

Univariate Analysis

  • Univariate analysis explores each variable in a data set, separately.
  • It looks at the range of values, as well as the central tendency of the values.
  • It describes the pattern of response to the variable.
  • It is quantitative data exploration that we do at the beginning of any analysis.
  • The purpose is to make data easier to interpret and to understand how data is distributed within a sample or population being studied.
  • Also helps us narrow down exactly what types of bivariate and multivariate analyses we should carry out.
  • Univariate Analysis Tools:
    1. Summary statistics -Determine the value’s center and spread. Like meanmedianstandard deviation, etc.
    2. Frequency table -This shows how frequently various values occur.
    3. Charts -A visual representation of the distribution of values. Visualizations, such as histogramsdistributionsfrequency tablesbar chartspie charts, and boxplots, are also commonly used in univariate analysis.

Bivariate Analysis:

  • Bivariate analysis is slightly more analytical than Univariate analysis.
  • When the data set contains two variables and researchers aim to undertake comparisons between the two data sets, then Bivariate analysis is the right type of analysis technique.
  • This step is performed when the inputs and outputs are known.
    • 1st variable will be the Inputs
    • 2nd variable will be the output/target variable.
  • Bivariate Analysis technique:
    • Categorical v/s Numerical – sns.barplot(x=data[‘department_name’], y=data[‘length_of_service’])

    • Numerical v/s Numerical – sns.scatterplot(x=data[‘length_of_service’],y=data[‘age’])

    • Categorical v/s Categorical – sns.countplot(data[‘STATUS_YEAR’],hue=data[‘STATUS’])

Multivariate Analysis

  • It analyzes more than two variables at the same time to understand relationships or patterns.
  • Helps find how variables influence each other or group together.
  • Common methods like Plot a pair plot with Hue
    • sns.heatmap(data.corr(), annot=True)

Python Implementation for EDA:

Importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Loading Dataset

data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /07_Support Vector Machines/SVM Class /Test_loan_approved.csv')

See the first five rows

data.head()
Output:
    Loan_ID Gender Married     Education Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History Loan_Status (Approved)
0  LP001002   Male      No      Graduate            No         NaN             360.0             1.0                      Y
1  LP001003   Male     Yes      Graduate            No       128.0             360.0             1.0                      N
2  LP001005   Male     Yes      Graduate           Yes        66.0             360.0             1.0                      Y
3  LP001006   Male     Yes  Not Graduate            No       120.0             360.0             1.0                      Y
4  LP001008   Male      No      Graduate            No       141.0             360.0             1.0                      Y

See the last five rows

data.tail()
Output:
     Loan_ID  Gender Married Education Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History Loan_Status (Approved)
609  LP002978  Female      No  Graduate            No        71.0             360.0             1.0                      Y
610  LP002979    Male     Yes  Graduate            No        40.0             180.0             1.0                      Y
611  LP002983    Male     Yes  Graduate            No       253.0             360.0             1.0                      Y
612  LP002984    Male     Yes  Graduate            No       187.0             360.0             1.0                      Y
613  LP002990  Female      No  Graduate           Yes       133.0             360.0             0.0                      N

See the summary of data

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Loan_ID                 614 non-null    object 
 1   Gender                  601 non-null    object 
 2   Married                 611 non-null    object 
 3   Education               614 non-null    object 
 4   Self_Employed           582 non-null    object 
 5   LoanAmount              592 non-null    float64
 6   Loan_Amount_Term        600 non-null    float64
 7   Credit_History          564 non-null    float64
 8   Loan_Status (Approved)  614 non-null    object 
dtypes: float64(3), object(6)
memory usage: 43.3+ KB

Insights:

  1. Data contain object, integer & float data
  2. There is no null Value

See summary Statistics for numerical variables

data.describe()
Output:
        LoanAmount  Loan_Amount_Term  Credit_History
count  592.000000         600.00000      564.000000
mean   146.412162         342.00000        0.842199
std     85.587325          65.12041        0.364878
min      9.000000          12.00000        0.000000
25%    100.000000         360.00000        1.000000
50%    128.000000         360.00000        1.000000
75%    168.000000         360.00000        1.000000
max    700.000000         480.00000        1.000000

Note: If you find that the standard deviation of a variable is zero, The variable may not be useful for analysis or modeling, 

See summary Statistics for Categorical variables

data.describe(include="O")
Output:
        Loan_ID Gender Married Education Self_Employed Loan_Status (Approved)
count        614    601     611       614           582                    614
unique       614      2       2         2             2                      2
top     LP001002   Male     Yes  Graduate            No                      Y
freq           1    489     398       480           500                    422

Insights:

  1. Male loans are approved the most
  2. Married people most loan approval

Count the number of unique values in each column

data.nunique()
Loan_ID                   614
Gender                      2
Married                     2
Education                   2
Self_Employed               2
LoanAmount                203
Loan_Amount_Term           10
Credit_History              2
Loan_Status (Approved)      2

Checking for missing values (null values)

data.isnull().sum()
Loan_ID                    0
Gender                    13
Married                    3
Education                  0
Self_Employed             32
LoanAmount                22
Loan_Amount_Term          14
Credit_History            50
Loan_Status (Approved)     0

Counting the occurrences of each unique value in a Series or a specific column of a DataFrame for a categorical feature

data['Married'].value_counts()
Output:
Yes    398
No     213
Name: Married, dtype: int64

Use a Python package to get an EDA report

  • Univariate analysis–sweetviz
# install sweetviz

!pip install sweetviz
Output:
Collecting sweetviz
Downloading sweetviz-2.3.1-py3-none-any.whl.metadata (24 kB)-----
--------------Successfully installed sweetviz-2.3.1
import sweetviz as sv #  library for univariant analysis

my_report1 = sv.analyze(data)## pass the original dataframe

my_report1.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
Output:
Done! Use 'show' commands to display/save.   
 [100%]   00:00 -> (00:00 left)
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

  • Bivaraite analysis–Autoviz
# install autoviz

!pip install autoviz
Output:
Collecting autoviz
Downloading autoviz.........
from autoviz import AutoViz_Class
AV = AutoViz_Class()

bivariate_report = AV.AutoViz('/content/drive/SVM Class /Test_loan_approved.csv',verbose=1)
Output:
Imported v0.1.905. Please call AutoViz in this sequence:
AV = AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Shape of your Data Set loaded: (614, 9)

########## C L A S S I F Y I N G V A R I A B L E S ##############
Classifying variables in data set...
Number of Numeric Columns = 2
Number of Integer-Categorical Columns = 0
Number of String-Categorical Columns = 0
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 5
Number of Numeric-Boolean Columns = 1
Number of Discrete String Columns = 0
Number of NLP String Columns = 0
Number of Date Time Columns = 0
Number of ID Columns = 1
Number of Columns to Delete = 0
9 Predictors classified...
1 variable(s) removed since they were ID or low-information variables
List of variables removed: ['Loan_ID']
To fix these data quality issues in the dataset, import FixDQ from autoviz...
All variables classified into correct types.
Number of All Scatter Plots = 3
All Plots done
Time to run AutoViz = 2 seconds

###################### AUTO VISUALIZATION Completed ########################

Manual Plotting:

#For Numerical data

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,7), facecolor='white')#To set canvas
plotnumber = 1#counter

dataN = data[['LoanAmount',"Loan_Amount_Term","Credit_History"]] # create new dataframe with numerical column

for column in dataN.columns:#accessing the columns
    if plotnumber<=4 :
        ax = plt.subplot(2,2,plotnumber)
        sns.histplot(x=dataN[column],hue=data['Loan_Status (Approved)'])
        plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
        plt.ylabel('Loan Status',fontsize=10)
        plt.title('Loan status')
        plotnumber+=1#counter increment
plt.tight_layout()
plt.show()

# For Categorical data

dataC = data[['Gender',"Married","Education","Self_Employed"]] # create new dataframe with numerical column

plt.figure(figsize=(7,8), facecolor='white')#To set canvas
plotnumber = 1#counter

for column in dataC:#accessing the columns
    ax = plt.subplot(3,3,plotnumber)
    sns.countplot(x=dataC[column],hue=data['Loan_Status (Approved)'])
    plt.xlabel(column,fontsize=10)#assign name to x-axis and set font-20
    plt.ylabel('Loan Status',fontsize=10)
    plotnumber+=1#counter increment
plt.tight_layout()

Multivariate Analysis

sns.pairplot(data.drop('Loan_ID',axis=1))

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

Social Profile

© Data4Fashion 2023-2025

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies