FindingData

Important functions of pandas - Data Aggregation

Important functions of pandas - Data Aggregation

23 Nov, 2020


Before reading this post, i would suggest you to read Data Transformation first.

Data Aggregation

img

Data aggregation plays a very important role in data analytics. Pandas provides many ways to perform data aggregation. Here I orgranised some functions that you must know as basics.

we will understand each function one by one...

concat : concatinate two dataframes

  • concat() function concatinates two different dataframes.
  • dataframes can be concatinated vertically as well as horizontally.
  • If you are not feeling well with indexs after concatinating dataframes, just recall reset_index() function.
python
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['Sumit', 'Aalit', 'Bob']})
df2 = pd.DataFrame({'id': [4, 5],
'name': ['AMan', 'Krish']})
pd.concat([df1, df2])

img

merge : merge two dataframes based on a column

  • merge() joins two tables like we do in SQL.
  • we need to specify which column that is used to join on with on='id' and specify how the two dataframes are joined.
python
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['Sumit', 'Aalit', 'Bob']})
df2 = pd.DataFrame({'id': [1, 2, 3],
'Sex': ['M', 'M', 'M']})
pd.merge(df1, df2, on='id', how='inner')

img

groupby() AND groupby().agg()

These two functions would be better demonstrated together, as the function df.groupby() cannot produce meaningful results by itself. It has to be used together with other functions that apply to the groups, which I believe df.groupby().agg() is the most common one.

python
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Chris', 'David', 'Ella'],
'language': ['Python', 'Java', 'Python', 'COBOL', 'Java'],
'age': [21, 32, 31, 68, 29]})
df.groupby('language').agg({'name': 'count', 'age': 'mean'})

img

Use_groupby()_with_describe()_for_group_summary_statistics
df.groupby('language').agg({'name': 'count', 'age': 'mean'}).describe()

img

There are many aggregate functions to use :

  • Frequency / Counts size() count()
  • Central Tendency mean() median()
  • Variance std() var
  • min() max() sum() prod() quantile()

pivot_table : groupby().agg() by pivot_table

For the above example, we can also implement it using the pd.pivot_table() function. That is, we need to specify the group keys and the measure values as follows

python
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Chris', 'David', 'Ella'],
'language': ['Python', 'Java', 'Python', 'COBOL', 'Java'],
'age': [21, 32, 31, 68, 29]})
pd.pivot_table(df,
values=['name', 'age'],
index=['language'],
aggfunc={'name': 'count', 'age': ['min', 'max', 'mean']})

img

Again, the Pandas library of Python has a lot more functions that makes it such a flexible and powerful data analytics tool in Python. In this post, I just organised the basic ones for data Aggregation that I believe are the most useful.

Edit this page on GitHub