Python cumulative sum per group with pandas

Python and pandas offers great functions for programmers and data science. Once of this functions is cumsum which can be used with pandas groups in order to find the cumulative sum in a group.

Previous article about pandas and groups: Python and Pandas group by and sum

Video tutorial on the article: Python/Pandas cumulative sum per group

Let's see a simple example of pandas.cumsum() and what is the main idea behind this function explained for beginners:

import pandas as pd
import numpy as np

# define a simple pandas series
s = pd.Series([2, np.nan, 5, -1, 0])


print(s)
# cumulative sum of the previous series
print(s.cumsum())

# cumulative sum of first ten numbers
x = range(0, 10)

s = pd.Series(x).cumsum()
print(s)

result:

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64
0     0
1     1
2     3
3     6
4    10
5    15
6    21
7    28
8    36
9    45

So the cumulative sum is calculating the sum for each member up to the current one. This is extremely useful for data science and many business areas. The first result is the output of our number series. The second one we calculate the cumulative sum for this series - as you can see np.NaN ( similar to None but optimized for pandas needs) doesn't break the sum. You can change this behavior by:

s.cumsum(skipna=False)

which will result in:

0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

Now lets check a more complicated examples including grouping on pandas dataframes on one and more columns. Having this data:

  • table 1
Country Company Date Sells
0 India Samsung 10/09/18 15
1 India Samsung 10/09/18 81
2 USA Samsung 10/09/18 29
3 France Samsung 10/10/18 33
4 India Samsung 10/10/18 21
5 India LG 10/10/18 42
6 Germany LG 10/09/18 67
7 USA LG 10/09/18 35
8 Brazil LG 10/09/18 2
9 Tony LG 10/10/18 34
10 India Sony 10/10/18 21
11 Germany Sony 10/10/18 50
12 India Sony 10/10/18 10
13 India Sony 10/10/18 26
14 Brazil Sony 10/10/18 53

table 2

Company Date Sells Promo
0 Samsung 08/09/2018 11 54
1 Samsung 10/09/2018 8 56
2 Samsung 09/09/2018 43 54
3 Samsung 04/09/2018 4 4
4 Sony 08/09/2018 45 3
5 Sony 10/09/2018 534 34
6 Sony 09/09/2018 42 32
7 Sony 04/09/2018 32 15

If we want to calculate the cumulative sum of Promotions(Promo) per Company and for every date we can do:

import pandas as pd
df = pd.DataFrame({'Date': ['08/09/2018', '10/09/2018', '09/09/2018', '04/09/2018', '08/09/2018', '10/09/2018',
                            '09/09/2018', '04/09/2018'],
                   'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Sony', 'Sony', 'Sony', 'Sony'],
                   'Promotions': [54, 56, 54, 4, 3, 34, 32, 15]})

print(df)
print (df.groupby(by=['Company','Date']).sum().groupby(level=[0]).cumsum())

result:

Promotions
Company Date
Samsung 4/9/2018 4
8/9/2018 58
9/9/2018 112
10/9/2018 168
Sony 4/9/2018 15
8/9/2018 18
9/9/2018 50
10/9/2018 84

depending on your needs you can change the grouping or the levels. Lets check another example this time with grouping on 3 columns from the pandas dataframe:

import pandas as pd

df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
                               'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
                               'Sony', 'Sony'],
                   'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
                            '10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/10/2018'],
                   'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
                               'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
                               'India', 'India', 'Brazil'],
                   'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})

print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())

you can find the result of this execution below:

Sells
Country Company Date
Brazil LG 10/10/2018 34
10/9/2018 36
Sony 10/10/2018 89
France Samsung 10/10/2018 33
Germany LG 10/9/2018 67
Sony 10/10/2018 117
India LG 10/10/2018 42
Samsung 10/10/2018 63
10/9/2018 159
Sony 10/10/2018 216
USA LG 10/9/2018 35
Samsung 10/9/2018 64

as you can see the Date column is not sorted as a date but as a string. Because it's stored as a string.

If you want to convert string dates to normal dates with Python pandas then you can first convert your column from string to date and after that do the same operation:

import pandas as pd

df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
                               'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
                               'Sony', 'Sony'],
                   'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
                            '10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/10/2018'],
                   'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
                               'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
                               'India', 'India', 'Brazil'],
                   'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})

df['Date'] = pd.to_datetime(df['Date'])

print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())

result:

Sells
Country Company Date
Brazil LG 10/9/2018 2
10/10/2018 36
Sony 10/10/2018 89
France Samsung 10/10/2018 33
Germany LG 10/9/2018 67
Sony 10/10/2018 117
India LG 10/10/2018 42
Samsung 10/9/2018 138
10/10/2018 159
Sony 10/10/2018 216
USA LG 10/9/2018 35
Samsung 10/9/2018 64

Related Article