What is Pandas:

  • Pandas is a python library used for working with data sets

  • It has functions for analyzing, cleaning, exploring, and manipulating data

  • The name “Pandas” has a reference to both ” Panel Data” and “Python Data Analysis .

  • It was created by Wes McKinney in 2008

Why Use Pandas?

  • Pandas allow us to analyze big data and make conclusions based on statistical theories

  • Pandas can clean messy data sets, and make them readable and relevant.

Installation of Pandas:

! pip install pandas

Import Pandas:

# Use keyword 'import'

import pandas

import pandas as pd #pd is common alias for pandas

# In python alias are an alternate name for referring to the same thing

Checking Pandas Version:

# The version string is stored under _version_ attribute

pd.__version__

#pd.__version__
Output: 
1.3.5

Pandas Series:

  • A Pandas Series is like a column in a table

  • It is a one-dimensional array holding data of any type

Creating Series:

# Creating empty series

s1 = pd.Series()
s1
Output:
Series([], dtype: float64)
# Creating series with one element

s2 = pd.Series(18)
s2
Output:
0    18
dtype: int64
# Creating series using tuple
t=(12,23,45)

s3 = pd.Series(t)
s3
Output:
0    12
1    23
2    45
dtype: int64
# Creating series using list

l = [12,34,67,89]

s4 = pd.Series([12,34,67,89])
#s4 = pd.Series(l)   #both way is correct
s4
Output:
0    12
1    34
2    67
3    89
dtype: int64
# Creating series with array

import numpy as np

arr=np.array([13,21,53,54])

s5 = pd.Series(arr)
s5
Output:
0    13
1    21
2    53
3    54
dtype: int64
# Creating series using dictionary

d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d)
s6

#Note: The keys of the dictionary become the labels.
Output:
a    23
b    54
c    76
d    76
dtype: int64
# Creating series using dictionary, creating series from only labels mentioned in index
d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d,index=['b','d'])
s6
Output:
b    54
d    76
dtype: int64
# Adding index value/Create Labels

arr2 = np.array([23,34,45])

s7=pd.Series(arr2,index=['One','Two','Three'])
s7
Output:
One      23
Two      34
Three    45
dtype: int64
# Checking data type

type(s7)
Output:
pandas.core.series.Series

Accessing Data:

#Accessing by index number

s5[2]
#s5.2 , will give error
Output:
53
# Accessing data by index label, below both way can be used
s7.Two
s7['Two']
Output:
53

Slicing Operation:

# Creating Series for slicing purposes

d=pd.Series([1,2,3,5,47,98,7,8,6,32,78,2,8,289,258,78])


# Slicing by index range

d[2:10]
Output:
2     3
3     5
4    47
5    98
6     7
7     8
8     6
9    32
dtype: int64
# Slicing with multiple index number

d[[3,5,7,9]]
Output:
3     5
5    98
7     8
9    32
dtype: int64
# Changing the element by slicing

d[[3,5,7,9]]=1000
d[0:2]=2000
d
Output:
0     2000
1     2000
2        3
3     1000
4       47
5     1000
6        7
7     1000
8        6
9     1000
10      78
11       2
12       8
13     289
14     258
15      78
dtype: int64

DataFrames:

  • DataFrame is a two-dimensional data structure, like a 2-dimensional array or a table with rows & columns

  • Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

  • A series is like a column, DataFrame is a whole table

Creating DataFrame:

# Creating empty DataFrame

df = pd.DataFrame()
print(df)
Output:
Empty DataFrame
Columns: []
Index: []
# Creating DataFrame from dictionary

data={"Calories":[450,210,410,321],"Duration":[43,23,56,43]}
df1= pd.DataFrame(data)
df1
Output:
     Calories  Duration
0	450	43
1	210	23
2	410	56
3	321	43
# Creating DataFrame from nested list

data= [[20,32,13,14],[32,"a",31,24],[7.3,41,13]]

df2 = pd.DataFrame(data)
df2
Output:
	0	1	2	3
0	20.0	32	13	14.0
1	32.0	a	31	24.0
2	7.3	41	13	NaN
# Use index to add labels to index position

df3 = pd.DataFrame(data,index=['a','b','c'])
df3
Output:

         0	 1	 2	 3
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN
# Use columns to add columns name

df4 = pd.DataFrame(data,index=['a','b','c'],columns=['col1','col2','col3','col4'])
df4
Output:

        col1	col2	col3	col4
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN
# Change the name of columns using rename() function

df4.rename({'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},axis=1,inplace=True)  #or below

df4.rename(columns={'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},inplace=True)
df4
Output:
	DK_Wt	KH_Wt	BR_wt	CH_Wt
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN

Accessing & modification of the element from DataFrame:

# Creating DataFrame for our next function
data=[['ALex',10,'Maths'],['Bob',12,'Science'],['Kelly',15,'Eco'],
      ['Boris',14,'Geo'],['Ken',18,'English']]

df=pd.DataFrame(data,columns=['Name','Age','Subject'])
df
Output:

        Name	Age	Subject
0	ALex	10	Maths
1	Bob	12	Science
2	Kelly	15	Eco
3	Boris	14	Geo
4	Ken	18	English
# Accessing one single column

#df['Age']
df.Age
Output:
0    10
1    12
2    15
3    14
4    18
Name: Age, dtype: int64
# Accessing several single column

df[['Age','Subject']] #Need to use double square bracket for more than one column
Output:

        Age	Subject
0	10	Maths
1	12	Science
2	15	Eco
3	14	Geo
4	18	English
# Accessing rows using iloc & index number

df.iloc[1:3]
Output:

        Name	Age	Subject
1	Bob	12	Science
2	Kelly	15	Eco
# Accessing specific columns

df.iloc[:,[1]]
Output:
	Age
0	10
1	12
2	15
3	14
4	18
# Accessing multiple columns

df.iloc[:,[2,0]]
Output:

        Subject	Name
0	Maths	ALex
1	Science	Bob
2	Eco	Kelly
3	Geo	Boris
4	English	Ken
# Accessing multiple rows & columns

df.iloc[[1,3,4],[2,0]]
Output:
	Subject	Name
1	Science	Bob
3	Geo	Boris
4	English	Ken
# Accessing specific rows & columns , slicing of DataFrame

df.iloc[1:3,1:2]
Output:

        Age
1	12
2	15
# Accessing single element from DataFrame

df.iloc[3,2]
Output:
Geo
# Accessing DataFrame elements using loc function

df.loc[1:3,['Subject','Name']]
Output:

        Subject	Name
1	Science	Bob
2	Eco	Kelly
3	Geo	Boris
# Manipulating data of inside of Data frame

df.iloc[:,1]=df.iloc[:,1] + 10 # use arithmatic operator like +,-,*
df
Output:

        Name	Age	Subject
0	ALex	18	Maths
1	Bob	20	Science
2	Kelly	23	Eco
3	Boris	22	Geo
4	Ken	26	English
# Manipulating string elements from inside of DataFrame

df.iloc[:2,0]=['Masud','Rana']
df
Output:

        Name	Age	Subject
0	Masud	18	Maths
1	Rana	20	Science
2	Kelly	23	Eco
3	Boris	22	Geo
4	Ken	26	English
# Modify all data of specific columns

df['Subject']='French'
df['Subject']
Output:
0    French
1    French
2    French
3    French
4    French
Name: Subject, dtype: object

Load/Read Files into a DataFrame:

  • A simple way to store big data sets is to use CSV files

  • CSV stands for comma-separated files

# reading the csv file

df_rd = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv')
df_rd
Output:

        a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
# reading csv file without header

df_rd_nh = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv',header=None)
df_rd_nh
Output:
	0	1	2	3	4
0	a	b	c	d	message
1	1	2	3	4	hello
2	5	6	7	8	world
3	9	10	11	12	foo
pwd # See present working directory
Output:
/content
# Use of r prefix which indicate the path should be interpreted as a raw string
# instead of r we can use \\

df = pd.read_csv(r'C:\Users\username\Documents\data.csv')

df = pd.read_csv('C:\\Users\\username\\Documents\\data.csv')

# Reading tab separated -One option to access tsv

#df_rd01=pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv',sep='\t')

df_rd01=pd.read_table(('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv'))
df_rd01
Output:
	test	test	test.1	test.2	test.3
Data	Data	Data	Data	Data	Data
Science	Science	Science	Science	Science	Science
# Using separator

df_rd02 = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Examples for file reading-02/ex7.csv',sep='*')
df_rd02
Output:

        a	b	c
0	1	2	3
1	1	2	3
# reading html data

df_html = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2015.html')

df_html[2]  # Indexing use to get specific table from website
    Rk                    Team   G     MP    FG   FGA    FG%    3P   3PA    3P%    2P   2PA    2P%    FT   FTA    FT%   ORB   DRB   TRB   AST  STL  BLK   TOV    PF    PTS
0  1.0  Golden State Warriors*  82  240.6  41.6  87.0  0.478  10.8  27.0  0.398  30.8  60.0  0.514  16.0  20.8  0.768  10.4  34.3  44.7  27.4  9.3  6.0  14.5  19.9  110.0
1  2.0   Los Angeles Clippers*  82  240.6  39.4  83.3  0.473  10.1  26.9  0.376  29.3  56.4  0.519  17.9  25.2  0.710   9.6  33.1  42.6  24.8  7.8  5.0  12.3  21.3  106.7
2  3.0       Dallas Mavericks*  82  242.4  39.7  85.8  0.463   8.9  25.4  0.352  30.8  60.4  0.509  16.9  22.5  0.752  10.5  31.8  42.3  22.5  8.1  4.5  13.0  20.0  105.2
3  4.0   Oklahoma City Thunder  82  241.8  38.8  86.8  0.447   7.7  22.7  0.339  31.1  64.1  0.486  18.6  24.6  0.754  12.8  34.7  47.5  20.5  7.3  5.5  14.7  22.3  104.0
4  5.0        Toronto Raptors*  82  242.1  37.9  83.3  0.455   8.9  25.1  0.352  29.0  58.2  0.499  19.3  24.6  0.787  10.7  30.8  41.5  20.7  7.5  4.4  12.9  20.9  104.0
[ ]


# reading the excel file
# sheet_name to use to specify which tab I want to use from excel file

df_ex= pd.read_excel('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.xlsx',sheet_name='XYA')
df_ex
Output:
	1	2	3	3.1
0	4	5	6	9
1	7	8	5	2
# reading json file

df_json = pd.read_json('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/example.json')
df_json
Output:

        a	b	c	d
0	1	2	3	NaN
1	4	5	6	10.0
2	7	8	9	NaN

Json File:

  • JSON stands for JavaScript Object Notation
  • It is a text format for storing & transporting data.
# Saving data using .to_csv

df_sv=df_html[2] # creating DataFrame copying from previous dataframe

df_sv.to_csv('htmldata.csv')

# reading row inside row

df_mind = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/csv_mindex.csv',index_col=['key1','key2'])
df_mind
Output:

           value1	value2
key1	key2		
one	a	1	2
        b	3	4
        c	5	6
        d	7	8
two	a	9	10
        b	11	12
        c	13	14
        d	15	16

How to check if dataset contain any categorical feature & column name

df_html[2].select_dtypes(include=['object','category'])  # To see values

df_html[2].select_dtypes(include=['object','category']).columns # To see only column name

Data Manipulation:

import pandas as pd
import numpy as np
# Creating DataFrame for next function

df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Pandas Class/train.csv')
# Checking the head part(First 5 rows)

df_dm.head()
   PassengerId  Survived  Pclass                                                 Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                              Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                               Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                             Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

# Checking the tail part( Last five rows)

df_dm.tail()

     PassengerId  Survived  Pclass                                      Name     Sex   Age  SibSp  Parch      Ticket   Fare Cabin Embarked
886          887         0       2                     Montvila, Rev. Juozas    male  27.0      0      0      211536  13.00   NaN        S
887          888         1       1              Graham, Miss. Margaret Edith  female  19.0      0      0      112053  30.00   B42        S
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1      2  W./C. 6607  23.45   NaN        S
889          890         1       1                     Behr, Mr. Karl Howell    male  26.0      0      0      111369  30.00  C148        C
890          891         0       3                       Dooley, Mr. Patrick    male  32.0      0      0      370376   7.75   NaN        Q
# Checking the shape of DataFrame

df_dm.shape
Output:
(891, 12)
# Checking the information of DataFrame using .info()

df_dm.info()
Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# Checking statistics of numerical variables using describe()

df_dm.describe()
       PassengerId    Survived      Pclass         Age       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

# Checking statistics of non-numerical or integer type data

df_dm.describe(include='O')

                           Name   Sex  Ticket    Cabin Embarked
count                       891   891     891      204      889
unique                      891     2     681      147        3
top     Braund, Mr. Owen Harris  male  347082  B96 B98        S
freq                          1   577       7        4      644
# Checking the unique values in that features /column using .unique()

df_dm.Sex.unique() #or below one

df_dm['Sex'].unique()
Output:
array(['male', 'female'], dtype=object)
# See all columns name

df_dm.columns
Output:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
# Find out total count of unique value using .value_counts()

df_dm.Sex.value_counts()  #or below one

df_dm['Sex'].value_counts()
Output:
male      577
female    314
Name: Sex, dtype: int64
# getting a specific column from Data Frame

#df_dm.Survived # or below one
df_dm['Survived']
Output:
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64
# Displaying all rows

pd.set_option('display.max_rows',None) 

#to return back original format first 5 rows & last 5 rows

pd.reset_option('display.max_rows',None) 
# Displaying all columns

pd.set_option('display.max_columns',None) 

#to return back original format

pd.reset_option('display.max_columns',None) 

Filtering data with one or more condition:
# & refers 'and' , both condition  should be meet to show relevant rows

df_dm.loc[(df_dm['Sex']=='female') & (df_dm['Age']==26)] 
    PassengerId  Survived  Pclass                                             Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
2              3         1       3                           Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
290          291         1       1                     Barber, Miss. Ellen "Nellie"  female  26.0      0      0             19877  78.8500   NaN        S
312          313         0       2            Lahtinen, Mrs. William (Anna Sylfven)  female  26.0      1      1            250651  26.0000   NaN        S
315          316         1       3                  Nilsson, Miss. Helmina Josefina  female  26.0      0      0            347470   7.8542   NaN        S
617          618         0       3  Lobb, Mrs. William Arthur (Cordelia K Stanlick)  female  26.0      1      0         A/5. 3336  16.1000   NaN        S


# | refer 'or' any condition fulfil , that rows will be shown

df_dm.loc[(df_dm['Sex']=='female') | (df_dm['Age']==26)] 
     PassengerId  Survived  Pclass                                              Name     Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked
871          872         1       1  Beckwith, Mrs. Richard Leonard (Sallie Monypeny)  female  47.0      1      1       11751  52.5542   D35        S
874          875         1       2             Abelson, Mrs. Samuel (Hannah Wizosky)  female  28.0      1      0   P/PP 3381  24.0000   NaN        C
875          876         1       3                  Najib, Miss. Adele Kiamie "Jane"  female  15.0      0      0        2667   7.2250   NaN        C
879          880         1       1     Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)  female  56.0      0      1       11767  83.1583   C50        C
880          881         1       2      Shelley, Mrs. William (Imanita Parrish Hall)  female  25.0      0      1      230433  26.0000   NaN        S
882          883         0       3                      Dahlberg, Miss. Gerda Ulrika  female  22.0      0      0        7552  10.5167   NaN        S
885          886         0       3              Rice, Mrs. William (Margaret Norton)  female  39.0      0      5      382652  29.1250   NaN        Q
887          888         1       1                      Graham, Miss. Margaret Edith  female  19.0      0      0      112053  30.0000   B42        S
888          889         0       3          Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1      2  W./C. 6607  23.4500   NaN        S
889          890         1       1                             Behr, Mr. Karl Howell    male  26.0      0      0      111369  30.0000  C148        C

# make any specific column as index

df_dm.set_index('Sex',inplace=True)
df_dm
        PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
Sex                                                                                                                                                     
male              1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S
female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C
female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S
male              5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S


# Return to original format

df_dm.reset_index('Sex',inplace=True) 
df_dm
      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S

Use of inplace:

  • inplace=’True’ will make changes in the original data frame
  • Default is inplace=’False’ which will not make any changes permanent in the original Dataframe
#Creating sample dataframe

sample_df = df_dm[['Sex','Age']].head()
sample_df
      Sex   Age
0    male  22.0
1  female  38.0
2  female  26.0
3  female  35.0
4    male  35.0

#dropping one column but will original dataframe will remain same 

sample_df.drop('Sex',axis=1)
Output:
	Age
0	22.0
1	38.0
2	26.0
3	35.0
4	35.0
print(sample_df)
Output:

	Sex   Age
0	male	22.0
1	female	38.0
2	female	26.0
3	female	35.0
4	male	35.0
sample_df.drop('Sex',axis=1,inplace=True)
print(sample_df)
Output:

       Age
0	22.0
1	38.0
2	26.0
3	35.0
4	35.0
# Creating new column

df_dm['Subject']='Data Science'
print(df_dm.head())
      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked       Subject
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S  Data Science
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C  Data Science
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S  Data Science
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S  Data Science
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S  Data Science

# Dropping the column

df_dm.drop('Subject',axis=1,inplace=True)
print(df_dm.head())
      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S

What axis means? 

  • axis =0 will refer to row
  • axis =1 will refer to column
# Dropping row

df_dm.drop(1,axis=0) # Single row dropping

df_dm.drop([1,3,5],axis=0) # Multiple row dropping

#Note: 1,3,4... are index number & if we use inplace = True, changes will be permanent
# Keeping backup or copy to create new dataframe

new_df_dm =df_dm.copy()
new_df_dm
Output:

      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S

Note: Any changes in the original DataFrame will not impact on backup DataFrame

# Creating new column using existing

df_dm['Test']=df_dm['Pclass']+df_dm['Age']
df_dm.head()
Output:

      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare Cabin Embarked  Test
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500   NaN        S  25.0
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833   C85        C  39.0
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S  29.0
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000  C123        S  36.0
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500   NaN        S  38.0

Missing Value:

# Checking the missing value

df_dm.isnull().sum()
Output:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64
# Checking the missing value of single column

df_dm.Age.isnull().sum()
Output:
177
# getting the indexes of row where values are missing in specific column using where

missing_index=np.where(df_dm.Age.isnull())  #or below code

missing_index=np.where(df_dm.Age.isnull()==True)

print(missing_index)
Output:
(array([  5,  17,  19,  26,  28,  29,  31,  32,  36,  42,  45,  46,  47,
         48,  55,  64,  65,  76,  77,  82,  87,  95, 101, 107, 109, 121,
        126, 128, 140, 154, 158, 159, 166, 168, 176, 180, 181, 185, 186,
        196, 198, 201, 214, 223, 229, 235, 240, 241, 250, 256, 260, 264,
        270, 274, 277, 284, 295, 298, 300, 301, 303, 304, 306, 324, 330,
        334, 335, 347, 351, 354, 358, 359, 364, 367, 368, 375, 384, 388,
        409, 410, 411, 413, 415, 420, 425, 428, 431, 444, 451, 454, 457,
        459, 464, 466, 468, 470, 475, 481, 485, 490, 495, 497, 502, 507,
        511, 517, 522, 524, 527, 531, 533, 538, 547, 552, 557, 560, 563,
        564, 568, 573, 578, 584, 589, 593, 596, 598, 601, 602, 611, 612,
        613, 629, 633, 639, 643, 648, 650, 653, 656, 667, 669, 674, 680,
        692, 697, 709, 711, 718, 727, 732, 738, 739, 740, 760, 766, 768,
        773, 776, 778, 783, 790, 792, 793, 815, 825, 826, 828, 832, 837,
        839, 846, 849, 859, 863, 868, 878, 888]),)
# Getting the actual data(row) from the indexes

print(df_dm.loc[missing_index].head())
Output:
       Sex  PassengerId  Survived  Pclass                           Name  Age  SibSp  Parch  Ticket     Fare Cabin Embarked  Test
5     male            6         0       3               Moran, Mr. James  NaN      0      0  330877   8.4583   NaN        Q   NaN
17    male           18         1       2   Williams, Mr. Charles Eugene  NaN      0      0  244373  13.0000   NaN        S   NaN
19  female           20         1       3        Masselmani, Mrs. Fatima  NaN      0      0    2649   7.2250   NaN        C   NaN
26    male           27         0       3        Emir, Mr. Farred Chehab  NaN      0      0    2631   7.2250   NaN        C   NaN
28  female           29         1       3  O'Dwyer, Miss. Ellen "Nellie"  NaN      0      0  330959   7.8792   NaN        Q   NaN
# Getting missing values by loc function

df_dm.loc[df_dm['Age'].isnull()] # or below

df_dm.loc[df_dm['Age'].isnull()==True]
Output:
       Sex  PassengerId  Survived  Pclass                           Name  Age  SibSp  Parch  Ticket     Fare Cabin Embarked  Test
5     male            6         0       3               Moran, Mr. James  NaN      0      0  330877   8.4583   NaN        Q   NaN
17    male           18         1       2   Williams, Mr. Charles Eugene  NaN      0      0  244373  13.0000   NaN        S   NaN
19  female           20         1       3        Masselmani, Mrs. Fatima  NaN      0      0    2649   7.2250   NaN        C   NaN
26    male           27         0       3        Emir, Mr. Farred Chehab  NaN      0      0    2631   7.2250   NaN        C   NaN
28  female           29         1       3  O'Dwyer, Miss. Ellen "Nellie"  NaN      0      0  330959   7.8792   NaN        Q   NaN

Handling Missing Values

  • Imputation

  • Dropping

# Use of fillna() with specific number

df_dm['Age'].fillna(value=200,inplace=True) # Use inplace to change in original Data Frame

df_dm['Age']=df_dm['Age'].fillna(value=200)  # Assign to the particular variable without impacting original dataframe if inplace not use
df_dm['Age']
Output:
0       22.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
886     27.0
887     19.0
888    200.0
889     26.0
890     32.0
Name: Age, Length: 891, dtype: float64
# Use of fillna() with specific string

df_dm['Cabin'].fillna('Delux',inplace=True) 

df_dm['Cabin'] = df_dm['Cabin'].fillna('Delux',inplace=True) 

df_dm['Cabin']
Output:
0      Delux
1        C85
2      Delux
3       C123
4      Delux
       ...  
886    Delux
887      B42
888    Delux
889     C148
890    Delux
Name: Cabin, Length: 891, dtype: object

How to find value in filling in missing places or NaN

  1. For numerical data, if data is normal distribution , mean can be used using .mean()
  2. For numerical data ,if data is skewed or not normal distribution, median can be used using .median()
  3. For categorical data , mode can be used using .mode()
  4. For categorical data, if all unique values are same ,method=’ffill’ , method=’bfill’can be used
  5. If Data sets are big , dropna() can be used to remove few rows
# Fill missing value using mean

# Mean = the average value (the sum of all values divided by number of values).

x = df_dm['Age'].mean()           # Best practice
df_dm['Age'].fillna(x,inplace=True)

df_dm['Age'].fillna(df_dm['Age'].mean(),inplace=True)

df_dm['Age']=df_dm['Age'].fillna(df_dm['Age'].mean())
# Fill missing value using median

# Median = the value in the middle, after you have sorted all values ascending.

x = df_dm['Age'].median()         # Best practice
df_dm['Age'].fillna(x,inplace=True)

df_dm['Age'].fillna(df_dm['Age'].median(),inplace=True)

df_dm['Age']=df_dm['Age'].fillna(df_dm['Age'].median())
# Fill missing value using mode

# Mode = The value that appears most frequently

x = df_dm['Cabin'].mode()[0]          # Best practice
df_dm['Cabin'].fillna(x,inplace=True)

df_dm['Cabin'].fillna(df_dm['Cabin'].mode()[0],inplace=True)   # In case of Mode ,there might more than one mode value , [0] is used to provide first mode value

df_dm['Cabin']=df_dm['Cabin'].fillna(df_dm['Cabin'].mode()[0])
# Use of dropna() to remove rows containing null values

#df_dm.dropna(inplace=True) # or below

df_dm_new = df_dm.dropna() # to create new DataFrame removing null values
# Using loc function to impute missing on specific column

df_dm.loc[df_dm['Age'].isnull(),'Age']=30  #.isnull()ensure all columns containing rows only where null value present, 'Age' specified only Age column

Duplicate row

df_dm.duplicated().sum()
Output:
0

Sorting the DataFrame

df_dm.head()
      Sex  PassengerId  Survived  Pclass                                                 Name   Age  SibSp  Parch            Ticket     Fare  Cabin Embarked  Test
0    male            1         0       3                              Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500  Delux        S  25.0
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0          PC 17599  71.2833    C85        C  39.0
2  female            3         1       3                               Heikkinen, Miss. Laina  26.0      0      0  STON/O2. 3101282   7.9250  Delux        S  29.0
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0            113803  53.1000   C123        S  36.0
4    male            5         0       3                             Allen, Mr. William Henry  35.0      0      0            373450   8.0500  Delux        S  38.0

# arranging

df_dm.sort_values('PassengerId')  # default ascending = True

df_dm.sort_values('PassengerId',ascending=False) # If descending needed , but index numberorder will be disturbed

df_dm.sort_values('PassengerId',ascending=False, ignore_index=True) # index order will be ignored & new order will be as per PassengerId column
Output:

      Sex  PassengerId  Survived  Pclass                                      Name    Age  SibSp  Parch            Ticket     Fare  Cabin Embarked  Test
0    male          891         0       3                       Dooley, Mr. Patrick   32.0      0      0            370376   7.7500  Delux        Q  35.0
1    male          890         1       1                     Behr, Mr. Karl Howell   26.0      0      0            111369  30.0000   C148        C  27.0
2  female          889         0       3  Johnston, Miss. Catherine Helen "Carrie"  200.0      1      2        W./C. 6607  23.4500  Delux        S   NaN
3  female          888         1       1              Graham, Miss. Margaret Edith   19.0      0      0            112053  30.0000    B42        S  20.0
4    male          887         0       2                     Montvila, Rev. Juozas   27.0      0      0            211536  13.0000  Delux        S  29.0
5  female          886         0       3      Rice, Mrs. William (Margaret Norton)   39.0      0      5            382652  29.1250  Delux        Q  42.0
6    male          885         0       3                    Sutehall, Mr. Henry Jr   25.0      0      0   SOTON/OQ 392076   7.0500  Delux        S  28.0
7    male          884         0       2             Banfield, Mr. Frederick James   28.0      0      0  C.A./SOTON 34068  10.5000  Delux        S  30.0
8  female          883         0       3              Dahlberg, Miss. Gerda Ulrika   22.0      0      0              7552  10.5167  Delux        S  25.0
9    male          882         0       3                        Markun, Mr. Johann   33.0      0      0            349257   7.8958  Delux        S  36.0
# Arranged with multiple columns

df_dm.sort_values(by=['PassengerId','Pclass'],ascending=[1,1])  # sort by first column , then within this order next column will be arranged
Output:
      Sex  PassengerId  Survived  Pclass                                                 Name    Age  SibSp  Parch            Ticket     Fare  Cabin Embarked  Test
0    male            1         0       3                              Braund, Mr. Owen Harris   22.0      1      0         A/5 21171   7.2500  Delux        S  25.0
1  female            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)   38.0      1      0          PC 17599  71.2833    C85        C  39.0
2  female            3         1       3                               Heikkinen, Miss. Laina   26.0      0      0  STON/O2. 3101282   7.9250  Delux        S  29.0
3  female            4         1       1         Futrelle, Mrs. Jacques Heath (Lily May Peel)   35.0      1      0            113803  53.1000   C123        S  36.0
4    male            5         0       3                             Allen, Mr. William Henry   35.0      0      0            373450   8.0500  Delux        S  38.0
5    male            6         0       3                                     Moran, Mr. James  200.0      0      0            330877   8.4583  Delux        Q   NaN
6    male            7         0       1                              McCarthy, Mr. Timothy J   54.0      0      0             17463  51.8625    E46        S  55.0
7    male            8         0       3                       Palsson, Master. Gosta Leonard    2.0      3      1            349909  21.0750  Delux        S   5.0
8  female            9         1       3    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   27.0      0      2            347742  11.1333  Delux        S  30.0
9  female           10         1       2                  Nasser, Mrs. Nicholas (Adele Achem)   14.0      1      0            237736  30.0708  Delux        C  16.0

Grouping the data

df_dm.groupby(df_dm['Pclass']).count()  # show Passenger class wise totalnumber element

df_dm.groupby(df_dm['Pclass']).mean()   # show Passenger class wise mean for all other numerical column

df_dm.groupby(df_dm['Pclass']).max('Fare') # show Passenger class wise max fare

df_dm.groupby(df_dm['Pclass']).min('Fare') # show Passenger class wise min Fare

df_dm.groupby(df_dm['Pclass']).describe()  # show Passenger class wise statistics for other numerical feature

df_dm.groupby(df_dm['Pclass']).agg({'Fare':'max'})  # agg() shows passenger class wise max Fare

df_dm.groupby(df_dm['Pclass']).Age.agg(['max','min','count','median','mean']) # agg() aggregation function specify multiple function at once

Output:
          max   min  count  median       mean
Pclass                                       
1       200.0  0.92    216    40.0  60.701019
2       200.0  0.67    184    30.0  40.047989
3       200.0  0.42    491    30.0  73.574175
  • Check each code by yourself for result

Changing Data type

# See data type to change

df_dm.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# Changing data type to reduce size

df_dm['Fare']=df_dm['Fare'].astype('float16')  # float64 changed to float16 to reduce size
# Changing data type to convert smallest numerical data

df_dm['Fare']=pd.to_numeric(df_dm['Fare'],downcast='float') #he first argument specifies the object to convert, and the second argument 'downcast' specifies the smallest numerical data type to which the object should be converted.

df_dm['Parch']=pd.to_numeric(df_dm['Parch'],downcast='float') #he first argument specifies the object to convert, and the second argument 'downcast' specifies the smallest numerical data type to which the object should be converted.
#lets see difference in data type for fare & parch
df_dm.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    float32
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float32
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float32(2), float64(1), int64(4), object(5)
memory usage: 76.7+ KB

Joining the Data Frame

# Creating DataFrame

df1=pd.DataFrame({'A':['A0','A1','A2','A3','A4','A5'],
                 'B':['B0','B1','B2','B3','B4','B5'],
                'C':['C0','C1','C2','C3','C4','C5'],
                'D':['D0','D1','D2','D3','D4','D5']})

df1
Output:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D4
5  A5  B5  C5  D5
# Creating second dataframe

df2=pd.DataFrame({'A':['A6','A7','A8','A9','A10','A11'],
                 'B':['B6','B7','B8','B9','B10','B11'],
                'C':['C6','C7','C8','C9','C10','C11']})

df2
Output:
     A    B    C
0   A6   B6   C6
1   A7   B7   C7
2   A8   B8   C8
3   A9   B9   C9
4  A10  B10  C10
5  A11  B11  C11
# Join data frame using .concat()

df3 = pd.concat([df1,df2],axis=0, ignore_index=True)  # Join row wise
df3
Output:
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6  NaN
7    A7   B7   C7  NaN
8    A8   B8   C8  NaN
9    A9   B9   C9  NaN
10  A10  B10  C10  NaN
11  A11  B11  C11  NaN
df4 = pd.concat([df1,df2],axis=1, ignore_index=True)  # Join column wise
df4
Output:
    0   1   2   3    4    5    6
0  A0  B0  C0  D0   A6   B6   C6
1  A1  B1  C1  D1   A7   B7   C7
2  A2  B2  C2  D2   A8   B8   C8
3  A3  B3  C3  D3   A9   B9   C9
4  A4  B4  C4  D4  A10  B10  C10
5  A5  B5  C5  D5  A11  B11  C11

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies