Often with Python and Pandas you import data from outside - CSV, JSON etc - and the data format could be different from the one you expect. For example dates and numbers can come as strings. This cause problems when you need to group and sort by this values stored as strings instead of a their correct type. Fortunately pandas offers quick and easy way of converting dataframe columns. In this article we can see how date stored as a string is converted to pandas date.
You can see previous posts about pandas here:
Below is the code example which is used for this conversion:
df['Date'] = pd.to_datetime(df['Date'])
or you can see/use only the converted date of a single column by:
print(pd.to_datetime(df['Date']))
import pandas as pd
df = pd.DataFrame({'Company': ['Samsung', 'Samsung', 'Samsung', 'Samsung', 'Samsung',
'LG', 'LG', 'LG', 'LG', 'LG', 'Sony', 'Sony', 'Sony',
'Sony', 'Sony'],
'Date': ['10/9/2018', '10/9/2018', '10/9/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/9/2018', '10/9/2018',
'10/9/2018', '10/10/2018', '10/10/2018', '10/10/2018',
'10/10/2018', '10/10/2018', '10/10/2018'],
'Country': ['India', 'India', 'USA', 'France', 'India', 'India',
'Germany', 'USA', 'Brazil', 'Brazil', 'India', 'Germany',
'India', 'India', 'Brazil'],
'Sells': [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53]})
print(df.groupby(['Country', 'Date']).sum().groupby(level=[0]).cumsum())
print(pd.to_datetime(df['Date']))
df['Date'] = pd.to_datetime(df['Date'])
print(df.groupby(['Country', 'Date']).sum().groupby(level=[0]).cumsum())
print(df.groupby(['Country', 'Date']).sum().groupby(level=[0]).groups)
result:
- before convert of Date column
Sells | ||
---|---|---|
Country | Date | |
Brazil | 10/10/2018 | 87 |
10/9/2018 | 89 | |
France | 10/10/2018 | 33 |
Germany | 10/10/2018 | 50 |
10/9/2018 | 117 | |
India | 10/10/2018 | 120 |
10/9/2018 | 216 | |
USA | 10/9/2018 | 64 |
- after convert of Date column
Country | Date | |
---|---|---|
Brazil | 10/9/2018 | 2 |
10/10/2018 | 89 | |
France | 10/10/2018 | 33 |
Germany | 10/9/2018 | 67 |
10/10/2018 | 117 | |
India | 10/9/2018 | 96 |
10/10/2018 | 216 | |
USA | 10/9/2018 | 64 |
Now the column date is properly formatted and the sum is based on sorted data in ascending order. Seeing the converted dates only:
0 2018-10-09
1 2018-10-09
2 2018-10-09
3 2018-10-10
4 2018-10-10
5 2018-10-10
6 2018-10-09
7 2018-10-09
8 2018-10-09
9 2018-10-10
10 2018-10-10
11 2018-10-10
12 2018-10-10
13 2018-10-10
14 2018-10-10
If you want to see only the groups than you can use the last line of code which product:
{'USA': MultiIndex(levels=[['Brazil', 'France', 'Germany', 'India', 'USA'], [2018-10-09 00:00:00, 2018-10-10 00:00:00]],
labels=[[4], [0]],
names=['Country', 'Date']), 'Brazil': MultiIndex(levels=[['Brazil', 'France', 'Germany', 'India', 'USA'], [2018-10-09 00:00:00, 2018-10-10 00:00:00]],
labels=[[0, 0], [0, 1]],
names=['Country', 'Date']), 'France': MultiIndex(levels=[['Brazil', 'France', 'Germany', 'India', 'USA'], [2018-10-09 00:00:00, 2018-10-10 00:00:00]],
labels=[[1], [1]],
names=['Country', 'Date']), 'Germany': MultiIndex(levels=[['Brazil', 'France', 'Germany', 'India', 'USA'], [2018-10-09 00:00:00, 2018-10-10 00:00:00]],
labels=[[2, 2], [0, 1]],
names=['Country', 'Date']), 'India': MultiIndex(levels=[['Brazil', 'France', 'Germany', 'India', 'USA'], [2018-10-09 00:00:00, 2018-10-10 00:00:00]],
labels=[[3, 3], [0, 1]],
names=['Country', 'Date'])}