What is Pandas:
Pandas is a python library used for working with data sets
It has functions for analyzing, cleaning, exploring, and manipulating data
The name “Pandas” has a reference to both ” Panel Data” and “Python Data Analysis .
It was created by Wes McKinney in 2008
Why Use Pandas?
Pandas allow us to analyze big data and make conclusions based on statistical theories
Pandas can clean messy data sets, and make them readable and relevant.
Installation of Pandas:
! pip install pandas
Import Pandas:
# Use keyword 'import'
import pandas
import pandas as pd #pd is common alias for pandas
# In python alias are an alternate name for referring to the same thing
Checking Pandas Version:
# The version string is stored under _version_ attribute
pd.__version__
#pd.__version__
Output: 1.3.5
Pandas Series:
A Pandas Series is like a column in a table
It is a one-dimensional array holding data of any type
Creating Series:
# Creating empty series
s1 = pd.Series()
s1
Output: Series([], dtype: float64)
# Creating series with one element
s2 = pd.Series(18)
s2
Output: 0 18 dtype: int64
# Creating series using tuple
t=(12,23,45)
s3 = pd.Series(t)
s3
Output: 0 12 1 23 2 45 dtype: int64
# Creating series using list
l = [12,34,67,89]
s4 = pd.Series([12,34,67,89])
#s4 = pd.Series(l) #both way is correct
s4
Output: 0 12 1 34 2 67 3 89 dtype: int64
# Creating series with array
import numpy as np
arr=np.array([13,21,53,54])
s5 = pd.Series(arr)
s5
Output: 0 13 1 21 2 53 3 54 dtype: int64
# Creating series using dictionary
d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d)
s6
#Note: The keys of the dictionary become the labels.
Output: a 23 b 54 c 76 d 76 dtype: int64
# Creating series using dictionary, creating series from only labels mentioned in index
d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d,index=['b','d'])
s6
Output: b 54 d 76 dtype: int64
# Adding index value/Create Labels
arr2 = np.array([23,34,45])
s7=pd.Series(arr2,index=['One','Two','Three'])
s7
Output: One 23 Two 34 Three 45 dtype: int64
# Checking data type
type(s7)
Output: pandas.core.series.Series
Accessing Data:
#Accessing by index number
s5[2]
#s5.2 , will give error
Output: 53
# Accessing data by index label, below both way can be used
s7.Two
s7['Two']
Output: 53
Slicing Operation:
# Creating Series for slicing purposes
d=pd.Series([1,2,3,5,47,98,7,8,6,32,78,2,8,289,258,78])
# Slicing by index range
d[2:10]
Output: 2 3 3 5 4 47 5 98 6 7 7 8 8 6 9 32 dtype: int64
# Slicing with multiple index number
d[[3,5,7,9]]
Output: 3 5 5 98 7 8 9 32 dtype: int64
# Changing the element by slicing
d[[3,5,7,9]]=1000
d[0:2]=2000
d
Output: 0 2000 1 2000 2 3 3 1000 4 47 5 1000 6 7 7 1000 8 6 9 1000 10 78 11 2 12 8 13 289 14 258 15 78 dtype: int64
DataFrames:
DataFrame is a two-dimensional data structure, like a 2-dimensional array or a table with rows & columns
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
A series is like a column, DataFrame is a whole table
Creating DataFrame:
# Creating empty DataFrame
df = pd.DataFrame()
print(df)
Output: Empty DataFrame Columns: [] Index: []
# Creating DataFrame from dictionary
data={"Calories":[450,210,410,321],"Duration":[43,23,56,43]}
df1= pd.DataFrame(data)
df1
Output: Calories Duration 0 450 43 1 210 23 2 410 56 3 321 43
# Creating DataFrame from nested list
data= [[20,32,13,14],[32,"a",31,24],[7.3,41,13]]
df2 = pd.DataFrame(data)
df2
Output: 0 1 2 3 0 20.0 32 13 14.0 1 32.0 a 31 24.0 2 7.3 41 13 NaN
# Use index to add labels to index position
df3 = pd.DataFrame(data,index=['a','b','c'])
df3
Output: 0 1 2 3 a 20.0 32 13 14.0 b 32.0 a 31 24.0 c 7.3 41 13 NaN
# Use columns to add columns name
df4 = pd.DataFrame(data,index=['a','b','c'],columns=['col1','col2','col3','col4'])
df4
Output: col1 col2 col3 col4 a 20.0 32 13 14.0 b 32.0 a 31 24.0 c 7.3 41 13 NaN
# Change the name of columns using rename() function
df4.rename({'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},axis=1,inplace=True) #or below
df4.rename(columns={'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},inplace=True)
df4
Output: DK_Wt KH_Wt BR_wt CH_Wt a 20.0 32 13 14.0 b 32.0 a 31 24.0 c 7.3 41 13 NaN
Accessing & modification of the element from DataFrame:
# Creating DataFrame for our next function
data=[['ALex',10,'Maths'],['Bob',12,'Science'],['Kelly',15,'Eco'],
['Boris',14,'Geo'],['Ken',18,'English']]
df=pd.DataFrame(data,columns=['Name','Age','Subject'])
df
Output: Name Age Subject 0 ALex 10 Maths 1 Bob 12 Science 2 Kelly 15 Eco 3 Boris 14 Geo 4 Ken 18 English
# Accessing one single column
#df['Age']
df.Age
Output: 0 10 1 12 2 15 3 14 4 18 Name: Age, dtype: int64
# Accessing several single column
df[['Age','Subject']] #Need to use double square bracket for more than one column
Output: Age Subject 0 10 Maths 1 12 Science 2 15 Eco 3 14 Geo 4 18 English
# Accessing rows using iloc & index number
df.iloc[1:3]
Output: Name Age Subject 1 Bob 12 Science 2 Kelly 15 Eco
# Accessing specific columns
df.iloc[:,[1]]
Output: Age 0 10 1 12 2 15 3 14 4 18
# Accessing multiple columns
df.iloc[:,[2,0]]
Output: Subject Name 0 Maths ALex 1 Science Bob 2 Eco Kelly 3 Geo Boris 4 English Ken
# Accessing multiple rows & columns
df.iloc[[1,3,4],[2,0]]
Output: Subject Name 1 Science Bob 3 Geo Boris 4 English Ken
# Accessing specific rows & columns , slicing of DataFrame
df.iloc[1:3,1:2]
Output: Age 1 12 2 15
# Accessing single element from DataFrame
df.iloc[3,2]
Output: Geo
# Accessing DataFrame elements using loc function
df.loc[1:3,['Subject','Name']]
Output: Subject Name 1 Science Bob 2 Eco Kelly 3 Geo Boris
# Manipulating data of inside of Data frame
df.iloc[:,1]=df.iloc[:,1] + 10 # use arithmatic operator like +,-,*
df
Output: Name Age Subject 0 ALex 18 Maths 1 Bob 20 Science 2 Kelly 23 Eco 3 Boris 22 Geo 4 Ken 26 English
# Manipulating string elements from inside of DataFrame
df.iloc[:2,0]=['Masud','Rana']
df
Output: Name Age Subject 0 Masud 18 Maths 1 Rana 20 Science 2 Kelly 23 Eco 3 Boris 22 Geo 4 Ken 26 English
# Modify all data of specific columns
df['Subject']='French'
df['Subject']
Output: 0 French 1 French 2 French 3 French 4 French Name: Subject, dtype: object
Load/Read Files into a DataFrame:
A simple way to store big data sets is to use CSV files
CSV stands for comma-separated files
# reading the csv file
df_rd = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv')
df_rd
Output: a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
# reading csv file without header
df_rd_nh = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv',header=None)
df_rd_nh
Output: 0 1 2 3 4 0 a b c d message 1 1 2 3 4 hello 2 5 6 7 8 world 3 9 10 11 12 foo
pwd # See present working directory
Output: /content
# Use of r prefix which indicate the path should be interpreted as a raw string
# instead of r we can use \\
df = pd.read_csv(r'C:\Users\username\Documents\data.csv')
df = pd.read_csv('C:\\Users\\username\\Documents\\data.csv')
# Reading tab separated -One option to access tsv
#df_rd01=pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv',sep='\t')
df_rd01=pd.read_table(('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv'))
df_rd01
Output: test test test.1 test.2 test.3 Data Data Data Data Data Data Science Science Science Science Science Science
# Using separator
df_rd02 = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Examples for file reading-02/ex7.csv',sep='*')
df_rd02
Output: a b c 0 1 2 3 1 1 2 3
# reading html data
df_html = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2015.html')
df_html[2] # Indexing use to get specific table from website
Rk Team G MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS 0 1.0 Golden State Warriors* 82 240.6 41.6 87.0 0.478 10.8 27.0 0.398 30.8 60.0 0.514 16.0 20.8 0.768 10.4 34.3 44.7 27.4 9.3 6.0 14.5 19.9 110.0 1 2.0 Los Angeles Clippers* 82 240.6 39.4 83.3 0.473 10.1 26.9 0.376 29.3 56.4 0.519 17.9 25.2 0.710 9.6 33.1 42.6 24.8 7.8 5.0 12.3 21.3 106.7 2 3.0 Dallas Mavericks* 82 242.4 39.7 85.8 0.463 8.9 25.4 0.352 30.8 60.4 0.509 16.9 22.5 0.752 10.5 31.8 42.3 22.5 8.1 4.5 13.0 20.0 105.2 3 4.0 Oklahoma City Thunder 82 241.8 38.8 86.8 0.447 7.7 22.7 0.339 31.1 64.1 0.486 18.6 24.6 0.754 12.8 34.7 47.5 20.5 7.3 5.5 14.7 22.3 104.0 4 5.0 Toronto Raptors* 82 242.1 37.9 83.3 0.455 8.9 25.1 0.352 29.0 58.2 0.499 19.3 24.6 0.787 10.7 30.8 41.5 20.7 7.5 4.4 12.9 20.9 104.0 [ ]
# reading the excel file
# sheet_name to use to specify which tab I want to use from excel file
df_ex= pd.read_excel('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.xlsx',sheet_name='XYA')
df_ex
Output: 1 2 3 3.1 0 4 5 6 9 1 7 8 5 2
# reading json file
df_json = pd.read_json('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/example.json')
df_json
Output: a b c d 0 1 2 3 NaN 1 4 5 6 10.0 2 7 8 9 NaN
Json File:
- JSON stands for JavaScript Object Notation
- It is a text format for storing & transporting data.
# Saving data using .to_csv
df_sv=df_html[2] # creating DataFrame copying from previous dataframe
df_sv.to_csv('htmldata.csv')
# reading row inside row
df_mind = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/csv_mindex.csv',index_col=['key1','key2'])
df_mind
Output: value1 value2 key1 key2 one a 1 2 b 3 4 c 5 6 d 7 8 two a 9 10 b 11 12 c 13 14 d 15 16
How to check if dataset contain any categorical feature & column name
df_html[2].select_dtypes(include=['object','category']) # To see values
df_html[2].select_dtypes(include=['object','category']).columns # To see only column name
Data Manipulation:
import pandas as pd
import numpy as np
# Creating DataFrame for next function
df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Pandas Class/train.csv')
# Checking the head part(First 5 rows)
df_dm.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# Checking the tail part( Last five rows)
df_dm.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
# Checking the shape of DataFrame
df_dm.shape
Output: (891, 12)
# Checking the information of DataFrame using .info()
df_dm.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
# Checking statistics of numerical variables using describe()
df_dm.describe()
PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
# Checking statistics of non-numerical or integer type data
df_dm.describe(include='O')
Name Sex Ticket Cabin Embarked count 891 891 891 204 889 unique 891 2 681 147 3 top Braund, Mr. Owen Harris male 347082 B96 B98 S freq 1 577 7 4 644
# Checking the unique values in that features /column using .unique()
df_dm.Sex.unique() #or below one
df_dm['Sex'].unique()
Output: array(['male', 'female'], dtype=object)
# See all columns name
df_dm.columns
Output: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
# Find out total count of unique value using .value_counts()
df_dm.Sex.value_counts() #or below one
df_dm['Sex'].value_counts()
Output: male 577 female 314 Name: Sex, dtype: int64
# getting a specific column from Data Frame
#df_dm.Survived # or below one
df_dm['Survived']
Output: 0 0 1 1 2 1 3 1 4 0 .. 886 0 887 1 888 0 889 1 890 0 Name: Survived, Length: 891, dtype: int64
# Displaying all rows
pd.set_option('display.max_rows',None)
#to return back original format first 5 rows & last 5 rows
pd.reset_option('display.max_rows',None)
# Displaying all columns
pd.set_option('display.max_columns',None)
#to return back original format
pd.reset_option('display.max_columns',None)
Filtering data with one or more condition:
# & refers 'and' , both condition should be meet to show relevant rows
df_dm.loc[(df_dm['Sex']=='female') & (df_dm['Age']==26)]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 290 291 1 1 Barber, Miss. Ellen "Nellie" female 26.0 0 0 19877 78.8500 NaN S 312 313 0 2 Lahtinen, Mrs. William (Anna Sylfven) female 26.0 1 1 250651 26.0000 NaN S 315 316 1 3 Nilsson, Miss. Helmina Josefina female 26.0 0 0 347470 7.8542 NaN S 617 618 0 3 Lobb, Mrs. William Arthur (Cordelia K Stanlick) female 26.0 1 0 A/5. 3336 16.1000 NaN S
# | refer 'or' any condition fulfil , that rows will be shown
df_dm.loc[(df_dm['Sex']=='female') | (df_dm['Age']==26)]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S 874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C 875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C 879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C 880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S 882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S 885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
# make any specific column as index
df_dm.set_index('Sex',inplace=True)
df_dm
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Sex male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S
# Return to original format
df_dm.reset_index('Sex',inplace=True)
df_dm
Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S
Use of inplace:
- inplace=’True’ will make changes in the original data frame
- Default is inplace=’False’ which will not make any changes permanent in the original Dataframe
#Creating sample dataframe
sample_df = df_dm[['Sex','Age']].head()
sample_df
Sex Age 0 male 22.0 1 female 38.0 2 female 26.0 3 female 35.0 4 male 35.0
#dropping one column but will original dataframe will remain same
sample_df.drop('Sex',axis=1)
Output: Age 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0
print(sample_df)
Output: Sex Age 0 male 22.0 1 female 38.0 2 female 26.0 3 female 35.0 4 male 35.0
sample_df.drop('Sex',axis=1,inplace=True)
print(sample_df)
Output: Age 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0
# Creating new column
df_dm['Subject']='Data Science'
print(df_dm.head())
Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Subject 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S Data Science 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C Data Science 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Data Science 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S Data Science 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S Data Science
# Dropping the column
df_dm.drop('Subject',axis=1,inplace=True)
print(df_dm.head())
Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S
What axis means?
- axis =0 will refer to row
- axis =1 will refer to column
# Dropping row
df_dm.drop(1,axis=0) # Single row dropping
df_dm.drop([1,3,5],axis=0) # Multiple row dropping
#Note: 1,3,4... are index number & if we use inplace = True, changes will be permanent
# Keeping backup or copy to create new dataframe
new_df_dm =df_dm.copy()
new_df_dm
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S
Note: Any changes in the original DataFrame will not impact on backup DataFrame
# Creating new column using existing
df_dm['Test']=df_dm['Pclass']+df_dm['Age']
df_dm.head()
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S 25.0 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 39.0 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 29.0 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 36.0 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S 38.0
Missing Value:
# Checking the missing value
df_dm.isnull().sum()
Output: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 0 Embarked 2 dtype: int64
# Checking the missing value of single column
df_dm.Age.isnull().sum()
Output:
177
# getting the indexes of row where values are missing in specific column using where
missing_index=np.where(df_dm.Age.isnull()) #or below code
missing_index=np.where(df_dm.Age.isnull()==True)
print(missing_index)
Output: (array([ 5, 17, 19, 26, 28, 29, 31, 32, 36, 42, 45, 46, 47, 48, 55, 64, 65, 76, 77, 82, 87, 95, 101, 107, 109, 121, 126, 128, 140, 154, 158, 159, 166, 168, 176, 180, 181, 185, 186, 196, 198, 201, 214, 223, 229, 235, 240, 241, 250, 256, 260, 264, 270, 274, 277, 284, 295, 298, 300, 301, 303, 304, 306, 324, 330, 334, 335, 347, 351, 354, 358, 359, 364, 367, 368, 375, 384, 388, 409, 410, 411, 413, 415, 420, 425, 428, 431, 444, 451, 454, 457, 459, 464, 466, 468, 470, 475, 481, 485, 490, 495, 497, 502, 507, 511, 517, 522, 524, 527, 531, 533, 538, 547, 552, 557, 560, 563, 564, 568, 573, 578, 584, 589, 593, 596, 598, 601, 602, 611, 612, 613, 629, 633, 639, 643, 648, 650, 653, 656, 667, 669, 674, 680, 692, 697, 709, 711, 718, 727, 732, 738, 739, 740, 760, 766, 768, 773, 776, 778, 783, 790, 792, 793, 815, 825, 826, 828, 832, 837, 839, 846, 849, 859, 863, 868, 878, 888]),)
# Getting the actual data(row) from the indexes
print(df_dm.loc[missing_index].head())
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 5 male 6 0 3 Moran, Mr. James NaN 0 0 330877 8.4583 NaN Q NaN 17 male 18 1 2 Williams, Mr. Charles Eugene NaN 0 0 244373 13.0000 NaN S NaN 19 female 20 1 3 Masselmani, Mrs. Fatima NaN 0 0 2649 7.2250 NaN C NaN 26 male 27 0 3 Emir, Mr. Farred Chehab NaN 0 0 2631 7.2250 NaN C NaN 28 female 29 1 3 O'Dwyer, Miss. Ellen "Nellie" NaN 0 0 330959 7.8792 NaN Q NaN
# Getting missing values by loc function
df_dm.loc[df_dm['Age'].isnull()] # or below
df_dm.loc[df_dm['Age'].isnull()==True]
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 5 male 6 0 3 Moran, Mr. James NaN 0 0 330877 8.4583 NaN Q NaN 17 male 18 1 2 Williams, Mr. Charles Eugene NaN 0 0 244373 13.0000 NaN S NaN 19 female 20 1 3 Masselmani, Mrs. Fatima NaN 0 0 2649 7.2250 NaN C NaN 26 male 27 0 3 Emir, Mr. Farred Chehab NaN 0 0 2631 7.2250 NaN C NaN 28 female 29 1 3 O'Dwyer, Miss. Ellen "Nellie" NaN 0 0 330959 7.8792 NaN Q NaN
Handling Missing Values
Imputation
Dropping
# Use of fillna() with specific number
df_dm['Age'].fillna(value=200,inplace=True) # Use inplace to change in original Data Frame
df_dm['Age']=df_dm['Age'].fillna(value=200) # Assign to the particular variable without impacting original dataframe if inplace not use
df_dm['Age']
Output: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 200.0 889 26.0 890 32.0 Name: Age, Length: 891, dtype: float64
# Use of fillna() with specific string
df_dm['Cabin'].fillna('Delux',inplace=True)
df_dm['Cabin'] = df_dm['Cabin'].fillna('Delux',inplace=True)
df_dm['Cabin']
Output: 0 Delux 1 C85 2 Delux 3 C123 4 Delux ... 886 Delux 887 B42 888 Delux 889 C148 890 Delux Name: Cabin, Length: 891, dtype: object
How to find value in filling in missing places or NaN
- For numerical data, if data is normal distribution , mean can be used using .mean()
- For numerical data ,if data is skewed or not normal distribution, median can be used using .median()
- For categorical data , mode can be used using .mode()
- For categorical data, if all unique values are same ,method=’ffill’ , method=’bfill’can be used
- If Data sets are big , dropna() can be used to remove few rows
# Fill missing value using mean
# Mean = the average value (the sum of all values divided by number of values).
x = df_dm['Age'].mean() # Best practice
df_dm['Age'].fillna(x,inplace=True)
df_dm['Age'].fillna(df_dm['Age'].mean(),inplace=True)
df_dm['Age']=df_dm['Age'].fillna(df_dm['Age'].mean())
# Fill missing value using median
# Median = the value in the middle, after you have sorted all values ascending.
x = df_dm['Age'].median() # Best practice
df_dm['Age'].fillna(x,inplace=True)
df_dm['Age'].fillna(df_dm['Age'].median(),inplace=True)
df_dm['Age']=df_dm['Age'].fillna(df_dm['Age'].median())
# Fill missing value using mode
# Mode = The value that appears most frequently
x = df_dm['Cabin'].mode()[0] # Best practice
df_dm['Cabin'].fillna(x,inplace=True)
df_dm['Cabin'].fillna(df_dm['Cabin'].mode()[0],inplace=True) # In case of Mode ,there might more than one mode value , [0] is used to provide first mode value
df_dm['Cabin']=df_dm['Cabin'].fillna(df_dm['Cabin'].mode()[0])
# Use of dropna() to remove rows containing null values
#df_dm.dropna(inplace=True) # or below
df_dm_new = df_dm.dropna() # to create new DataFrame removing null values
# Using loc function to impute missing on specific column
df_dm.loc[df_dm['Age'].isnull(),'Age']=30 #.isnull()ensure all columns containing rows only where null value present, 'Age' specified only Age column
Duplicate row
df_dm.duplicated().sum()
Output: 0
Sorting the DataFrame
df_dm.head()
Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 Delux S 25.0 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 39.0 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 Delux S 29.0 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 36.0 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 Delux S 38.0
# arranging
df_dm.sort_values('PassengerId') # default ascending = True
df_dm.sort_values('PassengerId',ascending=False) # If descending needed , but index numberorder will be disturbed
df_dm.sort_values('PassengerId',ascending=False, ignore_index=True) # index order will be ignored & new order will be as per PassengerId column
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 0 male 891 0 3 Dooley, Mr. Patrick 32.0 0 0 370376 7.7500 Delux Q 35.0 1 male 890 1 1 Behr, Mr. Karl Howell 26.0 0 0 111369 30.0000 C148 C 27.0 2 female 889 0 3 Johnston, Miss. Catherine Helen "Carrie" 200.0 1 2 W./C. 6607 23.4500 Delux S NaN 3 female 888 1 1 Graham, Miss. Margaret Edith 19.0 0 0 112053 30.0000 B42 S 20.0 4 male 887 0 2 Montvila, Rev. Juozas 27.0 0 0 211536 13.0000 Delux S 29.0 5 female 886 0 3 Rice, Mrs. William (Margaret Norton) 39.0 0 5 382652 29.1250 Delux Q 42.0 6 male 885 0 3 Sutehall, Mr. Henry Jr 25.0 0 0 SOTON/OQ 392076 7.0500 Delux S 28.0 7 male 884 0 2 Banfield, Mr. Frederick James 28.0 0 0 C.A./SOTON 34068 10.5000 Delux S 30.0 8 female 883 0 3 Dahlberg, Miss. Gerda Ulrika 22.0 0 0 7552 10.5167 Delux S 25.0 9 male 882 0 3 Markun, Mr. Johann 33.0 0 0 349257 7.8958 Delux S 36.0
# Arranged with multiple columns
df_dm.sort_values(by=['PassengerId','Pclass'],ascending=[1,1]) # sort by first column , then within this order next column will be arranged
Output: Sex PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked Test 0 male 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 Delux S 25.0 1 female 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 PC 17599 71.2833 C85 C 39.0 2 female 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 Delux S 29.0 3 female 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 36.0 4 male 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 Delux S 38.0 5 male 6 0 3 Moran, Mr. James 200.0 0 0 330877 8.4583 Delux Q NaN 6 male 7 0 1 McCarthy, Mr. Timothy J 54.0 0 0 17463 51.8625 E46 S 55.0 7 male 8 0 3 Palsson, Master. Gosta Leonard 2.0 3 1 349909 21.0750 Delux S 5.0 8 female 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27.0 0 2 347742 11.1333 Delux S 30.0 9 female 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 14.0 1 0 237736 30.0708 Delux C 16.0
Grouping the data
df_dm.groupby(df_dm['Pclass']).count() # show Passenger class wise totalnumber element
df_dm.groupby(df_dm['Pclass']).mean() # show Passenger class wise mean for all other numerical column
df_dm.groupby(df_dm['Pclass']).max('Fare') # show Passenger class wise max fare
df_dm.groupby(df_dm['Pclass']).min('Fare') # show Passenger class wise min Fare
df_dm.groupby(df_dm['Pclass']).describe() # show Passenger class wise statistics for other numerical feature
df_dm.groupby(df_dm['Pclass']).agg({'Fare':'max'}) # agg() shows passenger class wise max Fare
df_dm.groupby(df_dm['Pclass']).Age.agg(['max','min','count','median','mean']) # agg() aggregation function specify multiple function at once
Output: max min count median mean Pclass 1 200.0 0.92 216 40.0 60.701019 2 200.0 0.67 184 30.0 40.047989 3 200.0 0.42 491 30.0 73.574175
- Check each code by yourself for result
Changing Data type
# See data type to change
df_dm.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
# Changing data type to reduce size
df_dm['Fare']=df_dm['Fare'].astype('float16') # float64 changed to float16 to reduce size
# Changing data type to convert smallest numerical data
df_dm['Fare']=pd.to_numeric(df_dm['Fare'],downcast='float') #he first argument specifies the object to convert, and the second argument 'downcast' specifies the smallest numerical data type to which the object should be converted.
df_dm['Parch']=pd.to_numeric(df_dm['Parch'],downcast='float') #he first argument specifies the object to convert, and the second argument 'downcast' specifies the smallest numerical data type to which the object should be converted.
#lets see difference in data type for fare & parch
df_dm.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null float32 8 Ticket 891 non-null object 9 Fare 891 non-null float32 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float32(2), float64(1), int64(4), object(5) memory usage: 76.7+ KB
Joining the Data Frame
# Creating DataFrame
df1=pd.DataFrame({'A':['A0','A1','A2','A3','A4','A5'],
'B':['B0','B1','B2','B3','B4','B5'],
'C':['C0','C1','C2','C3','C4','C5'],
'D':['D0','D1','D2','D3','D4','D5']})
df1
Output: A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5
# Creating second dataframe
df2=pd.DataFrame({'A':['A6','A7','A8','A9','A10','A11'],
'B':['B6','B7','B8','B9','B10','B11'],
'C':['C6','C7','C8','C9','C10','C11']})
df2
Output: A B C 0 A6 B6 C6 1 A7 B7 C7 2 A8 B8 C8 3 A9 B9 C9 4 A10 B10 C10 5 A11 B11 C11
# Join data frame using .concat()
df3 = pd.concat([df1,df2],axis=0, ignore_index=True) # Join row wise
df3
Output: A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 NaN 7 A7 B7 C7 NaN 8 A8 B8 C8 NaN 9 A9 B9 C9 NaN 10 A10 B10 C10 NaN 11 A11 B11 C11 NaN
df4 = pd.concat([df1,df2],axis=1, ignore_index=True) # Join column wise
df4
Output: 0 1 2 3 4 5 6 0 A0 B0 C0 D0 A6 B6 C6 1 A1 B1 C1 D1 A7 B7 C7 2 A2 B2 C2 D2 A8 B8 C8 3 A3 B3 C3 D3 A9 B9 C9 4 A4 B4 C4 D4 A10 B10 C10 5 A5 B5 C5 D5 A11 B11 C11