FindingData

Important functions of pandas - Data Preview

Important functions of pandas - Data Preview

Nov 22, 2020

Before reading this, I suggest you read the previous post Intro and data I/O with pandas

img

As one of the most powerful python library for data manipulation, pandas is a must learn library for data I/O, data manipulation, data cleaning, data transformation, data aggregation and a lot and yes pandas is open source.

In this post, I haven't covered all the functions that pandas provide but some of them which frequently used by us while dealing with data So what are we waiting for. let's see one by one

Notation

df <--> dataframe
columns <--> features
rows <--> records
function<--> mmethod

Create DataFrame

DataFrame is a data structure of pandas. If your data is typeof dataframe then u can apply all the functions that pandas provides.

padas-python
df = pd.DataFrame({
"id":[1,2,3,4,5,6,7],
"Name":['sumit', 'lalit', 'bob', 'vineet', 'pankaj', 'mehek', 'kajal']
})
df

img

Load the dataset

Download Dataset from here

import pandas as pd
df = pd.read_csv("titanic.csv")

img

Data preview

Before working on a problem we often want to know how is our problem look like what is the problem. Similiarly while woorking with datasets we want to get the general picture of our dataset we have, or whether our dataset is loaded into pandas dataframe or not. to do such, we will need to know some functions...

head : preview the first n (default=5)

head shows top n records from your dataset. well n=5 is default but you can always specify how many rows to show as df.head(10) to show top 10 records.

df.head()

img

tail : preview the last n (default=5)

tail is similiar to head tail shows last n records from the dataset. you can specify how many rows to show as df.tail(10) to show last 10 records.If you didn't specify anything then it will show last 5 records by default.

df.tail()

img

info : display basic info of data

info display all the basic information of your dataframe like

  • data types of each feature
  • shape of DataFrame
  • Non Null Count for each feature
  • memory usage
df.info()

img

sort values : Sort a specific column

sort_values function will sort given column based on the data it have in ascending or descending order.

  • If column have text data then it will arrange in alphabetical order or if it have number than it will sort according to natural numbers.
  • inplace is set to be true if you want to assign the sorted data frame back to your variable. Otherwise your dataframe will not be changed.
df.sort_values(by='Name', ascending=True, inplace=True)

img

columns : display all the columns

columns is a dataframe property not a function which display all the columns(features) that your dataframe have.

df.columns

img

dtypes : display datatypes of columns

dtypes is a dataframe property not a function which display datatypes of each feature. It is very helpful when you want to check datatypes for all the columns(features).

df.dtypes

img

shape : display the shape of dataframe

shape is also not a function, It is a property of dataframe which display count of all rows and columns in (R, C) format. where R refers to totoal number of rows (records) and C refers to number of columns (features).

df.shape

img

describe : show basic stats of each column

describe() display all the basic stats for each numerical feature.It basically gives you the basic picture of data distribution.

It ignores Categorical and other features, it only considers numerical features.

df.describe()

img

value counts : Count occurance of each column

  • value_counts() function applies on a pandas series rather than a pandas dataframe or we can say for a single column.
  • It counts the occurance of each value in the series whether numerical or Categorical.
df.Sex.value_counts()

img

unique : display each Category from column

unique() display all the unique Categories of singe column. It is good to apply when there is a categorical feature, to see whether can one handle it by creating a simple dictionary and mapping dictionary or not (we will discuss it later).

python
df['Sex'].unique()

img

select datatypes : seperate datatypes

Sometimes you need to work based on numerical, categorical, and datetime datatypes seperatly so that you can handle them individually so you can simply seperate all the different types of features by select_dtypes. include takes a list of datatypes inside select_dtypes as an argument.

python
num = df.select_dtypes(include=['int64', 'float64']).columns
cat = df.select_dtypes(include=['object']).columns
dt = df.select_dtypes(include=['datetime64[ns]']).columns
print("Numerical Columns :",num)
print("Categorical Columns :",cat)
print("datetime Columns :",dt)

img

Indeed, the Pandas library of Python has a lot more functions that makes it such a flexible and powerful data analytics tool in Python. In this post, I just organised the basic ones that I believe are the most useful.

Edit this page on GitHub