FindingData

Important functions of pandas - Data Cleansing

Nov 23, 2020

Data Cleansing

Before reading this post, i would suggest you to read Data Preview first.

We have covered about pandas, pandas dataframe, pandas series and most common functions of data preview that pandas provides in out previous posts.

It is quite common that the raw dataset we got is not perfect. So always need to clean our dataset before use. Here are some related functions that pandas have

We will work on same dataset

isna : filter NULL values from the dataframe

isna() returns a series of boolean values representing whether the Column record is NULL or not.

df.isna()

isna().sum() returns sum count of Null values present in each column. So u can get a rought idea about how many values are Null.

df.isna().sum()

as we can see Sex column have 177 NuLL values. let's drop them

We can drop all the rows having NULL values

python

df=df[~df['Sex'].isna()]                        
df.shape  # (714, 12)

~ sign at the beginning reversed the boolean value, since we want to reserve the rows that NOT having NULL values. Finally, the data frame df will be filtered by this boolean series, where the row with “False” boolean value will be discarded.

Remember our dataset shape was 891x12 and now it is 714x12.

dropna : drop null values from entire dataset

if we want to drop null values from entire dataset. we can use dropna() function.
dropna() filter all the rows having at least one null values. we can set the threshold for number of null values.
we can use isna() method but we will need to repeat many times for every column.
It is not a good idea to drop all the null values when there is a lot of null values according to our data.

df = df.dropna()
df.isna().sum()

fillna : fill null values

fillna() fills the null values something that we will provide.
In this example, we will fill null values with Missing string.

df.fillna()

drop_duplicates : drop duplicates rows

Sometimes the raw dataset may have some duplicated rows which we don't want.

python

df = pd.DataFrame({
    "id":[1,2,3,4,5,6,7],
    "Name":['sumit', 'lalit', 'bob', 'vineet', 'pankaj', 'mehek', 'sumit']
})
df.drop_duplicates()

In this example dataframe, sumit occurs two times so drop_duplicates will drop one sumit

drop : drop a list of columns

Suppose we want to some some columns that we dont want like _Id or the column that almost have all the null values, drop() will do it.
we can pass a list of multiple columns if we want to drop more than one column.
This function can be used to drop rows.

python

df = pd.DataFrame({
    "id":[1,2,3,4,5,6,7],
    "Name":['sumit', 'lalit', 'bob', 'vineet', 'pankaj', 'mehek', 'sumit']
})
df.drop(columns=['id'], axis=1) # axis=1 : drop column wise

rename : rename all the column names

rename() helped us to rename the columns headers.
Note : rename takes a dictionary as the parameter to map the columns.where keys are the old headers and values are the new headers.

python

df = pd.DataFrame({
    "id":[1,2,3,4,5,6,7],
    "Name":['sumit', 'lalit', 'bob', 'vineet', 'pankaj', 'mehek', 'sumit']
})
df.rename(columns={'id':'_id', 'Name':'Full_Name'})

reset_index : reset indexes

When we drop some rows that having null values then indexes are also reseted because some rows were deleted in the middle. So to reset the indexes reset_index function is used.

df.reset_index()

Indeed, the Pandas library of Python has a lot more functions that makes it such a flexible and powerful data analytics tool in Python. In this post, I just organised the basic ones for data cleansing that I believe are the most useful.

Edit this page on GitHub

FindingData

Important functions of pandas - Data Cleansing

Important functions of pandas - Data Cleansing

Data Cleansing

We will work on same dataset

isna : filter NULL values from the dataframe

dropna : drop null values from entire dataset

fillna : fill null values

drop_duplicates : drop duplicates rows

drop : drop a list of columns

rename : rename all the column names

reset_index : reset indexes

Recent Posts

On this page