Pandas Tutorial : How to split dataframe by string or date
In this article you will find 3 different examples about how to split a dataframe into new dataframes based on a column. The examples are:
- How to split dataframe on a month basis
- How to split dataframe per year
- Split dataframe on a string column
- References
Pandas: How to split dataframe on a month basis
You can see the dataframe on the picture below. Initially the columns: "day", "mm", "year" don't exists. We are going to split the dataframe into several groups depending on the month. For that purpose we are splitting column date into day, month and year. After that we will group on the month column. Finally we are printing the output dataframes:
import pandas as pd
df = pd.DataFrame(
{
"Company": [
"Samsung", "Samsung", "Samsung", "Samsung", "Samsung", "LG", "LG", "LG", "LG", "LG", "Sony", "Sony", "Sony",
"Sony", "Sony",
],
"Date": [
"10/9/2015", "10/9/2015", "10/9/2017", "10/10/2017", "10/10/2017", "10/10/2018", "10/9/2018", "10/9/2018",
"10/9/2018", "10/10/2016", "10/10/2016", "10/10/2016", "10/10/2019", "10/10/2019", "10/10/2019",
],
"Country": [
"India", "India", "USA", "France", "India", "India", "Germany", "USA", "Brazil", "Brazil", "India",
"Germany", "India", "India", "Brazil",
],
"Sells": [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53],
}
)
df[["day", "mm", "year"]] = df["Date"].str.split("/", expand=True)
agg = df.groupby(['mm'])
for group in agg:
print(group)
result:
Company | Date | Country | Sells day | mm | year | ||
---|---|---|---|---|---|---|---|
3 | Samsung | 10/10/2017 | France | 33 | 10 | 10 | 2017 |
4 | Samsung | 10/10/2017 | India | 21 | 10 | 10 | 2017 |
5 | LG | 10/10/2018 | India | 42 | 10 | 10 | 2018 |
9 | LG | 10/10/2016 | Brazil | 34 | 10 | 10 | 2016 |
10 | Sony | 10/10/2016 | India | 21 | 10 | 10 | 2016 |
11 | Sony | 10/10/2016 | Germany | 50 | 10 | 10 | 2016 |
12 | Sony | 10/10/2019 | India | 10 | 10 | 10 | 2019 |
13 | Sony | 10/10/2019 | India | 26 | 10 | 10 | 2019 |
14 | Sony | 10/10/2019 | Brazil | 53 | 10 | 10 | 2019 |
Company | Date | Country | Sells day mm | year | |||
0 | Samsung | 10/9/2015 | India | 15 | 10 | 9 | 2015 |
1 | Samsung | 10/9/2015 | India | 81 | 10 | 9 | 2015 |
2 | Samsung | 10/9/2017 | USA | 29 | 10 | 9 | 2017 |
6 | LG | 10/9/2018 | Germany | 67 | 10 | 9 | 2018 |
7 | LG | 10/9/2018 | USA | 35 | 10 | 9 | 2018 |
8 | LG | 10/9/2018 | Brazil | 2 | 10 | 9 | 2018 |
Pandas: How to split dataframe per year
This time we will use different approach in order to achieve similar behavior. First we will use lambda in order to convert the string into date. Then we are extracting the periods. The final part is to group by the extracted years:
import dateutil
import pandas as pd
df = pd.DataFrame(
{
"Company": [
"Samsung", "Samsung", "Samsung", "Samsung", "Samsung", "LG", "LG", "LG", "LG", "LG", "Sony", "Sony", "Sony",
"Sony", "Sony",
],
"Date": [
"10/9/2015", "10/9/2015", "10/9/2017", "10/10/2017", "10/10/2017", "10/10/2018", "10/9/2018", "10/9/2018",
"10/9/2018", "10/10/2016", "10/10/2016", "10/10/2016", "10/10/2019", "10/10/2019", "10/10/2019",
],
"Country": [
"India", "India", "USA", "France", "India", "India", "Germany", "USA", "Brazil", "Brazil", "India",
"Germany", "India", "India", "Brazil",
],
"Sells": [15, 81, 29, 33, 21, 42, 67, 35, 2, 34, 21, 50, 10, 26, 53],
}
)
df["Date"] = df["Date"].apply(lambda x: dateutil.parser.parse(x))
year = df["Date"].dt.to_period("Y")
agg = df.groupby([year])
for group in agg:
print(group)
result:
Company | Date Country | Sells | day | mm | year | |
---|---|---|---|---|---|---|
0 | Samsung 2015-10-09 | India | 15 | 10 | 9 | 2015 |
1 | Samsung 2015-10-09 | India | 81 | 10 | 9 | 2015 |
Company | Date | Country | Sells day | mm | year | |
9 | LG 2016-10-10 | Brazil | 34 | 10 | 10 | 2016 |
10 | Sony 2016-10-10 | India | 21 | 10 | 10 | 2016 |
11 | Sony 2016-10-10 | Germany | 50 | 10 | 10 | 2016 |
Company | Date Country | Sells day | mm | year | ||
2 | Samsung 2017-10-09 | USA | 29 | 10 | 9 | 2017 |
3 | Samsung 2017-10-10 | France | 33 | 10 | 10 | 2017 |
4 | Samsung 2017-10-10 | India | 21 | 10 | 10 | 2017 |
Company | Date | Country | Sells day | mm | year | |
5 | LG 2018-10-10 | India | 42 | 10 | 10 | 2018 |
6 | LG 2018-10-09 | Germany | 67 | 10 | 9 | 2018 |
7 | LG 2018-10-09 | USA | 35 | 10 | 9 | 2018 |
8 | LG 2018-10-09 | Brazil | 2 | 10 | 9 | 2018 |
Company | Date Country | Sells day | mm | year | ||
12 | Sony 2019-10-10 | India | 10 | 10 | 10 | 2019 |
13 | Sony 2019-10-10 | India | 26 | 10 | 10 | 2019 |
14 | Sony 2019-10-10 | Brazil | 53 | 10 | 10 | 2019 |
Pandas: Split dataframe on a strign column
This time the dataframe is a different one. And we have records for two companies inside. If our goal is to split this data frame into new ones based on the companies then we can do:
import pandas as pd
df = pd.DataFrame(
{
"Date": [
"08/09/2018", "10/09/2017", "09/09/2017", "04/09/2017", "08/09/2018", "10/09/2018", "09/09/2018",
"04/09/2018",
],
"Company": [
"Samsung", "Samsung", "Samsung", "Samsung", "Sony", "Sony", "Sony", "Sony",
],
}
)
group = df["Company"]
# group by that period
agg = df.groupby([group])
for year, group in agg:
print(group)
result:
Date | Company | |
---|---|---|
0 | 08/09/2018 | Samsung |
1 | 10/09/2017 | Samsung |
2 | 09/09/2017 | Samsung |
3 | 04/09/2017 | Samsung |
Date | Company | |
4 | 08/09/2018 | Sony |
5 | 10/09/2018 | Sony |
6 | 09/09/2018 | Sony |
7 | 04/09/2018 | Sony |