Python and pandas offers great functions for programmers and data science. Once of this functions is cumsum which can be used with pandas groups in order to find the cumulative sum in a group.
Previous article about pandas and groups: Python and Pandas group by and sum
Video tutorial on the article: Python/Pandas cumulative sum per group
Let's see a simple example of pandas.cumsum() and what is the main idea behind this function explained for beginners:
import pandas as pd
import numpy as np
# define a simple pandas series
s = pd.Series([2, np.nan, 5, -1, 0])
print(s)
# cumulative sum of the previous series
print(s.cumsum())
# cumulative sum of first ten numbers
x = range(0, 10)
s = pd.Series(x).cumsum()
print(s)
result:
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64
0 0
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
So the cumulative sum is calculating the sum for each member up to the current one. This is extremely useful for data science and many business areas. The first result is the output of our number series. The second one we calculate the cumulative sum for this series - as you can see np.NaN ( similar to None but optimized for pandas needs) doesn't break the sum. You can change this behavior by:
s.cumsum(skipna=False)
which will result in:
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
Now lets check a more complicated examples including grouping on pandas dataframes on one and more columns. Having this data:
- table 1
Country | Company | Date | Sells | |
---|---|---|---|---|
0 | India | Samsung | 10/09/18 | 15 |
1 | India | Samsung | 10/09/18 | 81 |
2 | USA | Samsung | 10/09/18 | 29 |
3 | France | Samsung | 10/10/18 | 33 |
4 | India | Samsung | 10/10/18 | 21 |
5 | India | LG | 10/10/18 | 42 |
6 | Germany | LG | 10/09/18 | 67 |
7 | USA | LG | 10/09/18 | 35 |
8 | Brazil | LG | 10/09/18 | 2 |
9 | Tony | LG | 10/10/18 | 34 |
10 | India | Sony | 10/10/18 | 21 |
11 | Germany | Sony | 10/10/18 | 50 |
12 | India | Sony | 10/10/18 | 10 |
13 | India | Sony | 10/10/18 | 26 |
14 | Brazil | Sony | 10/10/18 | 53 |
table 2
Company | Date | Sells | Promo | |
---|---|---|---|---|
0 | Samsung | 08/09/2018 | 11 | 54 |
1 | Samsung | 10/09/2018 | 8 | 56 |
2 | Samsung | 09/09/2018 | 43 | 54 |
3 | Samsung | 04/09/2018 | 4 | 4 |
4 | Sony | 08/09/2018 | 45 | 3 |
5 | Sony | 10/09/2018 | 534 | 34 |
6 | Sony | 09/09/2018 | 42 | 32 |
7 | Sony | 04/09/2018 | 32 | 15 |
If we want to calculate the cumulative sum of Promotions(Promo) per Company and for every date we can do:
import pandas as pd
df = pd.DataFrame({'Date': ['08/09/2018', '10/09/2018', '09/09/2018', '04/09/2018', '08/09/2018', '10/09/2018',
'09/09/2018', '04/09/2018'],
'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Sony', 'Sony', 'Sony', 'Sony'],
'Promotions': [54, 56, 54, 4, 3, 34, 32, 15]})
print(df)
print (df.groupby(by=['Company','Date']).sum().groupby(level=[0]).cumsum())
result:
Promotions | ||
---|---|---|
Company | Date | |
Samsung | 4/9/2018 | 4 |
8/9/2018 | 58 | |
9/9/2018 | 112 | |
10/9/2018 | 168 | |
Sony | 4/9/2018 | 15 |
8/9/2018 | 18 | |
9/9/2018 | 50 | |
10/9/2018 | 84 |
depending on your needs you can change the grouping or the levels. Lets check another example this time with grouping on 3 columns from the pandas dataframe:
import pandas as pd
df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
'Sony', 'Sony'],
'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
'10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/10/2018'],
'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
'India', 'India', 'Brazil'],
'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})
print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())
you can find the result of this execution below:
Sells | |||
---|---|---|---|
Country | Company | Date | |
Brazil | LG | 10/10/2018 | 34 |
10/9/2018 | 36 | ||
Sony | 10/10/2018 | 89 | |
France | Samsung | 10/10/2018 | 33 |
Germany | LG | 10/9/2018 | 67 |
Sony | 10/10/2018 | 117 | |
India | LG | 10/10/2018 | 42 |
Samsung | 10/10/2018 | 63 | |
10/9/2018 | 159 | ||
Sony | 10/10/2018 | 216 | |
USA | LG | 10/9/2018 | 35 |
Samsung | 10/9/2018 | 64 |
as you can see the Date column is not sorted as a date but as a string. Because it's stored as a string.
If you want to convert string dates to normal dates with Python pandas then you can first convert your column from string to date and after that do the same operation:
import pandas as pd
df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
'Sony', 'Sony'],
'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
'10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/10/2018'],
'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
'India', 'India', 'Brazil'],
'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})
df['Date'] = pd.to_datetime(df['Date'])
print(df.groupby(['Country', 'Company', 'Date']).sum().groupby(level=[0]).cumsum())
result:
Sells | |||
---|---|---|---|
Country | Company | Date | |
Brazil | LG | 10/9/2018 | 2 |
10/10/2018 | 36 | ||
Sony | 10/10/2018 | 89 | |
France | Samsung | 10/10/2018 | 33 |
Germany | LG | 10/9/2018 | 67 |
Sony | 10/10/2018 | 117 | |
India | LG | 10/10/2018 | 42 |
Samsung | 10/9/2018 | 138 | |
10/10/2018 | 159 | ||
Sony | 10/10/2018 | 216 | |
USA | LG | 10/9/2018 | 35 |
Samsung | 10/9/2018 | 64 |