What is Pandas:

  • Pandas is a python library used for working with data sets

  • It has functions for analyzing, cleaning, exploring, and manipulating data

  • The name “Pandas” has a reference to both ” Panel Data” and “Python Data Analysis .

  • It was created by Wes McKinney in 2008

Why Use Pandas?

  • Pandas allow us to analyze big data and make conclusions based on statistical theories

  • Pandas can clean messy data sets, and make them readable and relevant.

Installation of Pandas:

! pip install pandas

Import Pandas:

# Use keyword 'import'

import pandas

import pandas as pd #pd is common alias for pandas

# In python alias are an alternate name for referring to the same thing

Checking Pandas Version:

# The version string is stored under _version_ attribute

pd.__version__

#pd.__version__
Output: 
1.3.5

Pandas Series:

  • A Pandas Series is like a column in a table

  • It is a one-dimensional array holding data of any type

Creating Series:

# Creating empty series

s1 = pd.Series()
s1
Output:
Series([], dtype: float64)
# Creating series with one element

s2 = pd.Series(18)
s2
Output:
0    18
dtype: int64
# Creating series using tuple
t=(12,23,45)

s3 = pd.Series(t)
s3
Output:
0    12
1    23
2    45
dtype: int64
# Creating series using list

l = [12,34,67,89]

s4 = pd.Series([12,34,67,89])
#s4 = pd.Series(l)   #both way is correct
s4
Output:
0    12
1    34
2    67
3    89
dtype: int64
# Creating series with array

import numpy as np

arr=np.array([13,21,53,54])

s5 = pd.Series(arr)
s5
Output:
0    13
1    21
2    53
3    54
dtype: int64
# Creating series using dictionary

d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d)
s6

#Note: The keys of the dictionary become the labels.
Output:
a    23
b    54
c    76
d    76
dtype: int64
# Creating series using dictionary, creating series from only labels mentioned in index
d = {'a':23,'b':54,'c':76,'d':76}
s6 = pd.Series(d,index=['b','d'])
s6
Output:
b    54
d    76
dtype: int64
# Adding index value/Create Labels

arr2 = np.array([23,34,45])

s7=pd.Series(arr2,index=['One','Two','Three'])
s7
Output:
One      23
Two      34
Three    45
dtype: int64
# Checking data type

type(s7)
Output:
pandas.core.series.Series

Accessing Data:

#Accessing by index number

s5[2]
#s5.2 , will give error
Output:
53
# Accessing data by index label, below both way can be used
s7.Two
s7['Two']
Output:
53

Slicing Operation:

# Creating Series for slicing purposes

d=pd.Series([1,2,3,5,47,98,7,8,6,32,78,2,8,289,258,78])


# Slicing by index range

d[2:10]
Output:
2     3
3     5
4    47
5    98
6     7
7     8
8     6
9    32
dtype: int64
# Slicing with multiple index number

d[[3,5,7,9]]
Output:
3     5
5    98
7     8
9    32
dtype: int64
# Changing the element by slicing

d[[3,5,7,9]]=1000
d[0:2]=2000
d
Output:
0     2000
1     2000
2        3
3     1000
4       47
5     1000
6        7
7     1000
8        6
9     1000
10      78
11       2
12       8
13     289
14     258
15      78
dtype: int64

DataFrames:

  • DataFrame is a two-dimensional data structure, like a 2-dimensional array or a table with rows & columns

  • Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

  • A series is like a column, DataFrame is a whole table

Creating DataFrame:

# Creating empty DataFrame

df = pd.DataFrame()
print(df)
Output:
Empty DataFrame
Columns: []
Index: []
# Creating DataFrame from dictionary

data={"Calories":[450,210,410,321],"Duration":[43,23,56,43]}
df1= pd.DataFrame(data)
df1
Output:
     Calories  Duration
0	450	43
1	210	23
2	410	56
3	321	43
# Creating DataFrame from nested list

data= [[20,32,13,14],[32,"a",31,24],[7.3,41,13]]

df2 = pd.DataFrame(data)
df2
Output:
	0	1	2	3
0	20.0	32	13	14.0
1	32.0	a	31	24.0
2	7.3	41	13	NaN
# Use index to add labels to index position

df3 = pd.DataFrame(data,index=['a','b','c'])
df3
Output:

         0	 1	 2	 3
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN
# Use columns to add columns name

df4 = pd.DataFrame(data,index=['a','b','c'],columns=['col1','col2','col3','col4'])
df4
Output:

        col1	col2	col3	col4
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN
# Change the name of columns using rename() function

df4.rename({'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},axis=1,inplace=True)  #or below

df4.rename(columns={'col1':'DK_Wt','col2':'KH_Wt','col3':'BR_wt','col4':'CH_Wt'},inplace=True)
df4
Output:
	DK_Wt	KH_Wt	BR_wt	CH_Wt
a	20.0	32	13	14.0
b	32.0	a	31	24.0
c	7.3	41	13	NaN

Accessing & modification of the element from DataFrame:

# Creating DataFrame for our next function
data=[['ALex',10,'Maths'],['Bob',12,'Science'],['Kelly',15,'Eco'],
      ['Boris',14,'Geo'],['Ken',18,'English']]

df=pd.DataFrame(data,columns=['Name','Age','Subject'])
df
Output:

        Name	Age	Subject
0	ALex	10	Maths
1	Bob	12	Science
2	Kelly	15	Eco
3	Boris	14	Geo
4	Ken	18	English
# Accessing one single column

#df['Age']
df.Age
Output:
0    10
1    12
2    15
3    14
4    18
Name: Age, dtype: int64
# Accessing several single column

df[['Age','Subject']] #Need to use double square bracket for more than one column
Output:

        Age	Subject
0	10	Maths
1	12	Science
2	15	Eco
3	14	Geo
4	18	English
# Accessing rows using iloc & index number

df.iloc[1:3]
Output:

        Name	Age	Subject
1	Bob	12	Science
2	Kelly	15	Eco
# Accessing specific columns

df.iloc[:,[1]]
Output:
	Age
0	10
1	12
2	15
3	14
4	18
# Accessing multiple columns

df.iloc[:,[2,0]]
Output:

        Subject	Name
0	Maths	ALex
1	Science	Bob
2	Eco	Kelly
3	Geo	Boris
4	English	Ken
# Accessing multiple rows & columns

df.iloc[[1,3,4],[2,0]]
Output:
	Subject	Name
1	Science	Bob
3	Geo	Boris
4	English	Ken
# Accessing specific rows & columns , slicing of DataFrame

df.iloc[1:3,1:2]
Output:

        Age
1	12
2	15
# Accessing single element from DataFrame

df.iloc[3,2]
Output:
Geo
# Accessing DataFrame elements using loc function

df.loc[1:3,['Subject','Name']]
Output:

        Subject	Name
1	Science	Bob
2	Eco	Kelly
3	Geo	Boris
# Manipulating data of inside of Data frame

df.iloc[:,1]=df.iloc[:,1] + 10 # use arithmatic operator like +,-,*
df
Output:

        Name	Age	Subject
0	ALex	18	Maths
1	Bob	20	Science
2	Kelly	23	Eco
3	Boris	22	Geo
4	Ken	26	English
# Manipulating string elements from inside of DataFrame

df.iloc[:2,0]=['Masud','Rana']
df
Output:

        Name	Age	Subject
0	Masud	18	Maths
1	Rana	20	Science
2	Kelly	23	Eco
3	Boris	22	Geo
4	Ken	26	English
# Modify all data of specific columns

df['Subject']='French'
df['Subject']
Output:
0    French
1    French
2    French
3    French
4    French
Name: Subject, dtype: object

Load/Read Files into a DataFrame:

  • A simple way to store big data sets is to use CSV files

  • CSV stands for comma-separated files

# reading the csv file

df_rd = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv')
df_rd
Output:

        a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
# reading csv file without header

df_rd_nh = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.csv',header=None)
df_rd_nh
Output:
	0	1	2	3	4
0	a	b	c	d	message
1	1	2	3	4	hello
2	5	6	7	8	world
3	9	10	11	12	foo
pwd # See present working directory
Output:
/content
# Use of r prefix which indicate the path should be interpreted as a raw string
# instead of r we can use \\

df = pd.read_csv(r'C:\Users\username\Documents\data.csv')

df = pd.read_csv('C:\\Users\\username\\Documents\\data.csv')

# Reading tab separated -One option to access tsv

#df_rd01=pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv',sep='\t')

df_rd01=pd.read_table(('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/test.tsv'))
df_rd01
Output:
	test	test	test.1	test.2	test.3
Data	Data	Data	Data	Data	Data
Science	Science	Science	Science	Science	Science
# Using separator

df_rd02 = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Examples for file reading-02/ex7.csv',sep='*')
df_rd02
Output:

        a	b	c
0	1	2	3
1	1	2	3
# reading html data

df_html = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2015.html')

df_html[2]  # Indexing use to get specific table from website
Output:

Due to huge output , I am not putting here , please run from your IDE (PyCharm, Visual Studio Code, Jupyter notebook, Google Colab, etc)
# reading the excel file
# sheet_name to use to specify which tab I want to use from excel file

df_ex= pd.read_excel('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/ex1.xlsx',sheet_name='XYA')
df_ex
Output:
	1	2	3	3.1
0	4	5	6	9
1	7	8	5	2
# reading json file

df_json = pd.read_json('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/example.json')
df_json
Output:

        a	b	c	d
0	1	2	3	NaN
1	4	5	6	10.0
2	7	8	9	NaN

Json File:

  • JSON stands for JavaScript Object Notation
  • It is a text format for storing & transporting data.
# Saving data using .to_csv

df_sv=df_html[2] # creating DataFrame copying from previous dataframe

df_sv.to_csv('htmldata.csv')

# reading row inside row

df_mind = pd.read_csv('/content/drive/MyDrive/Data Science/Data Mites/Class Notes _ Material/Class notes/CDS-03-Pandas/Examples for file reading-02/csv_mindex.csv',index_col=['key1','key2'])
df_mind
Output:

           value1	value2
key1	key2		
one	a	1	2
        b	3	4
        c	5	6
        d	7	8
two	a	9	10
        b	11	12
        c	13	14
        d	15	16

How to check if dataset contain any categorical feature & column name

df_html[2].select_dtypes(include=['object','category'])  # To see values

df_html[2].select_dtypes(include=['object','category']).columns # To see only column name

Data Manipulation:

import pandas as pd
import numpy as np
# Creating DataFrame for next function

df_dm = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-04-Pandas/Pandas Class/train.csv')
# Checking the head part(First 5 rows)

df_dm.head()
	PassengerId	Survived	Pclass	Name	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Sex											
male	1	0	3	Braund, Mr. Owen Harris	22.0	1	0	A/5 21171	7.2500	NaN	S
female	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	0	PC 17599	71.2833	C85	C
female	3	1	3	Heikkinen, Miss. Laina	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
female	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	0	113803	53.1000	C123	S
male	5	0	3	Allen, Mr. William Henry	35.0	0	0	373450	8.0500	NaN	S

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies