Python cumulative sum per group with pandas

Python and pandas offers great functions for programmers and data science. Once of this functions is cumsum which can be used with pandas groups in order to find the cumulative sum in a group.

Previous article about pandas and groups: Python and Pandas group by and sum

Video tutorial on the article: Python/Pandas cumulative sum per group

Let's see a simple example of pandas.cumsum() and what is the main idea behind this function explained for beginners:

import pandas as pd
import numpy as np

# define a simple pandas series
s = pd.Series([2, np.nan, 5, -1, 0])


print(s)
# cumulative sum of the previous series
print(s.cumsum())

# cumulative sum of first ten numbers
x = range(0, 10)

s = pd.Series(x).cumsum()
print(s)

result:

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64
0     0
1     1
2     3
3     6
4    10
5    15
6    21
7    28
8    36
9    45

So the cumulative sum is calculating the sum for each member up to the current one. This is extremely useful for data science and many business areas. The first result is the output of our number series. The second one we calculate the cumulative sum for this series - as you can see np.NaN ( similar to None but optimized for pandas needs) doesn't break the sum. You can change this behavior by:

s.cumsum(skipna=False)

which will result in:

0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

Now lets check a more complicated examples including grouping on pandas dataframes on one and more columns. Having this data:

table 1

	Country	Company	Date	Sells
0	India	Samsung	10/09/18	15
1	India	Samsung	10/09/18	81
2	USA	Samsung	10/09/18	29
3	France	Samsung	10/10/18	33
4	India	Samsung	10/10/18	21
5	India	LG	10/10/18	42
6	Germany	LG	10/09/18	67
7	USA	LG	10/09/18	35
8	Brazil	LG	10/09/18	2
9	Tony	LG	10/10/18	34
10	India	Sony	10/10/18	21
11	Germany	Sony	10/10/18	50
12	India	Sony	10/10/18	10
13	India	Sony	10/10/18	26
14	Brazil	Sony	10/10/18	53

table 2

	Company	Date	Sells	Promo
0	Samsung	08/09/2018	11	54
1	Samsung	10/09/2018	8	56
2	Samsung	09/09/2018	43	54
3	Samsung	04/09/2018	4	4
4	Sony	08/09/2018	45	3
5	Sony	10/09/2018	534	34
6	Sony	09/09/2018	42	32
7	Sony	04/09/2018	32	15

If we want to calculate the cumulative sum of Promotions(Promo) per Company and for every date we can do:

import pandas as pd
df = pd.DataFrame({'Date': ['08/09/2018', '10/09/2018', '09/09/2018', '04/09/2018', '08/09/2018', '10/09/2018',
                            '09/09/2018', '04/09/2018'],
                   'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Sony', 'Sony', 'Sony', 'Sony'],
                   'Promotions': [54, 56, 54, 4, 3, 34, 32, 15]})

print(df)
print (df.groupby(by=['Company','Date']).sum().groupby(level=[0]).cumsum())

result:

		Promotions
Company	Date
Samsung	4/9/2018	4
	8/9/2018	58
	9/9/2018	112
	10/9/2018	168
Sony	4/9/2018	15
	8/9/2018	18
	9/9/2018	50
	10/9/2018	84

depending on your needs you can change the grouping or the levels. Lets check another example this time with grouping on 3 columns from the pandas dataframe:

import pandas as pd

df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
                               'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
                               'Sony', 'Sony'],
                   'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
                            '10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/10/2018'],
                   'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
                               'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
                               'India', 'India', 'Brazil'],
                   'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})

print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())

you can find the result of this execution below:

			Sells
Country	Company	Date
Brazil	LG	10/10/2018	34
		10/9/2018	36
	Sony	10/10/2018	89
France	Samsung	10/10/2018	33
Germany	LG	10/9/2018	67
	Sony	10/10/2018	117
India	LG	10/10/2018	42
	Samsung	10/10/2018	63
		10/9/2018	159
	Sony	10/10/2018	216
USA	LG	10/9/2018	35
	Samsung	10/9/2018	64

as you can see the Date column is not sorted as a date but as a string. Because it's stored as a string.

If you want to convert string dates to normal dates with Python pandas then you can first convert your column from string to date and after that do the same operation:

import pandas as pd

df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
                               'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
                               'Sony', 'Sony'],
                   'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
                            '10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
                            '10/10/2018', '10/10/2018', '10/10/2018'],
                   'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
                               'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
                               'India', 'India', 'Brazil'],
                   'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})

df['Date'] = pd.to_datetime(df['Date'])

print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())

result:

			Sells
Country	Company	Date
Brazil	LG	10/9/2018	2
		10/10/2018	36
	Sony	10/10/2018	89
France	Samsung	10/10/2018	33
Germany	LG	10/9/2018	67
	Sony	10/10/2018	117
India	LG	10/10/2018	42
	Samsung	10/9/2018	138
		10/10/2018	159
	Sony	10/10/2018	216
USA	LG	10/9/2018	35
	Samsung	10/9/2018	64

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet