Looking to select rows in a CSV file or a DataFrame based on date columns/range with Python/Pandas? If so, you can apply the next steps in order to get the rows between two dates in your DataFrame/CSV file.
The steps will depend on your situation and data. Below is described optimal sequence which should work for any case with small changes.
Video Tutorial
Notebook: Select rows between two dates DataFrame with Pandas
Step 1: Import Pandas and read data/create DataFrame
The first step is to read the CSV file and converted to a Pandas DataFrame. This step is important because impacts data types loaded - sometimes numbers and dates can be considered as objects - which will limit the operation available for them.
import pandas as pd
df = pd.read_csv("./tmp/data.csv")
In order to ensure that date columns are parsed correctly as Datetime you must implicitly add them like:
dateCols = ['datetime_col']
pd.read_csv("./tmp/data.csv", parse_dates=dateCols)
Notice:
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.
Step 2: Pandas: Verify columns containing dates
Next step is to ensure that columns which contain dates are stored with correct type: datetime64
. This verification can be done by:
df.dtypes
Expected result is something like:
pages int64
title object
is_loaded bool
loading_datetime datetime64[ns, UTC]
datetime_col object
if the column for date is stored as object then it should be converted to datetime.
Another possible way to verify the data is by:
df.datetime_col
You can see what is stored inside and data type:
0 2019-10-29 9:06:03
1 2019-10-31 11:16:43
2 2019-10-30 21:15:23
3 2019-10-30 20:26:35
Name: datetime_col, dtype: object
Step 3: Convert string to datetime in DataFrame
In order to convert a column stored as a object/string into a DataFrame you can try the next:
df.datetime_col=pd.to_datetime(df.datetime_col)
Now after a check you can expect to have type datetime64
Note: In order to avoid errors related to different timestamp formats you can use this parameter:
df.datetime_col=pd.to_datetime(df.datetime_col, utc= True)
which will:
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
Step 4: Select rows between two dates
1. Select rows based on dates with loc
If all the previous steps are done then you can apply the selection based on dates. One possible way to do this is by next:
start_date = pd.to_datetime('6/11/2018 9:32', utc= True)
end_date = pd.to_datetime('4/19/2020 11:42', utc= True)
df.loc[(df['datetime_col'] > start_date) & (df['datetime_col'] < end_date)]
this will filter all results between this two dates.
2. Use Series function between
A Pandas Series function between
can be used by giving the start and end date as Datetime. This is my preferred method to select rows based on dates.:
df[df.datetime_col.between(start_date, end_date)]
3. Select rows between two times
Sometimes you may need to filter the rows of a DataFrame based only on time. In this case you can use function: pandas.DataFrame.between_time
df.between_time('23:26', '23:50')
In order this selection to work you need to have index which is DatetimeIndex. This can be achieved by:
df = df.set_index(['datetime_col'])
4. Select rows based on dates without loc
Another possible way to achieve similar result is by:
df[(df['datetime_col'] > '2018-12-02') & (df['datetime_col'] <= '2018-12-03 23:26:10+00:00')]
Be careful because this option will work even if you try to use non Datetime columns and the result might be unexpected.
5. Use mask to mark the records
Final option is combination of several previous methods:
mask = (df['datetime_col'] > start_date) & (df['datetime_col'] <= end_date)
df.loc[mask]
This will filter the rows based on the mask - the mask can be reused later for different logselection and the DataFrame is not changed.
6. Select records from last month/30 days
Sometimes you will need to work with data from the last month/week/days. This can be done by:
df[df["datetime_col"] >= (pd.to_datetime('8/24/2019', utc=True) - pd.Timedelta(days=30))]
There are two things to be considered in this example:
pd.to_datetime('8/24/2019', utc=True)
- defines a date from which you are going to calculatepd.Timedelta(days=30)
- how many days you like to shift back.
Possible errors
-
If you try to convert column which is not a date by:
df.name=pd.to_datetime(df.name)
you will get the following error:ValueError: ('Unknown string format:', 'Pandas')
-
If you try to use pandas:
df.between_time(start_date, end_date)
with index which is not DatetimeIndex:TypeError: Index must be DatetimeIndex
-
In case of comparison between Datetime objects with different format like:
-
2015-05-13 08:41:00
-
2018-12-02 23:26:10+00:00
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects