Flexible Aggregation - Python

This post shows a way to aggregate data in a very flexible way. All you need is to modify to your use-case. Feel free to like and leave comment below. Good luck!

# import libraries
import pandas as pd
import numpy as np

# create sample dataset
data = pd.DataFrame({'name': ['Tom', 'Ali', 'Dammy', 'Dave', 'Ayo', 'Kenny', 'Jamil', 'Eri', 'Navee', 'Dan'],
                     'ranking': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1], 
                     'exercise_score': [12, 34, 56, 78, 90, 98, 76, 54, 32, 18]})

# sort dataframe by rank
data.sort_values('ranking', inplace=True)

# reset index
data.reset_index(drop=True, inplace=True)
print(data)

# aggregation function
def agg_func(df):
    return (
        df.ranking.iat[0],
        df.exercise_score.min(),
        df.exercise_score.max(),
        df.exercise_score.mean(),
        df.exercise_score.median()
    )

# apply aggregation function to condition
output = pd.DataFrame(zip(*data.groupby(['ranking', (data.ranking != data.ranking.shift()).cumsum()]).apply(agg_func))).T

# assign column names
output.columns = ['ranking', 'min_score', 'max_score', 'mean_score', 'median_score']
print(" ")
print(output)

"""
Some notes:
The following can be modified depending on use-case:
1. 'ranking' as used indicates a common class to data points. e.g. stock name, company name, product type etc.
2. The ranking in data.ranking != data.ranking.shift() can be modified to some condition like minute or hour in a datetime column format, true or false, high or low, or some other column conditions.
"""