Looking to select rows in a CSV file or a DataFrame based on date columns/range with Python/Pandas? If so, you can apply the next steps in order to get the rows between two dates in your DataFrame/CSV file.
The steps will depend on your situation and data. Below is described optimal sequence which should work for any case with small changes.
Notebook: Select rows between two dates DataFrame with Pandas
Step 1: Import Pandas and read data/create DataFrame
The first step is to read the CSV file and converted to a Pandas DataFrame. This step is important because impacts data types loaded - sometimes numbers and dates can be considered as objects - which will limit the operation available for them.
import pandas as pd df = pd.read_csv("./tmp/data.csv")
In order to ensure that date columns are parsed correctly as Datetime you must implicitly add them like:
dateCols = ['datetime_col'] pd.read_csv("./tmp/data.csv", parse_dates=dateCols)
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.
Step 2: Pandas: Verify columns containing dates
Next step is to ensure that columns which contain dates are stored with correct type:
datetime64. This verification can be done by:
Expected result is something like:
pages int64 title object is_loaded bool loading_datetime datetime64[ns, UTC] datetime_col object
if the column for date is stored as object then it should be converted to datetime.
Another possible way to verify the data is by:
You can see what is stored inside and data type:
0 2019-10-29 9:06:03 1 2019-10-31 11:16:43 2 2019-10-30 21:15:23 3 2019-10-30 20:26:35 Name: datetime_col, dtype: object
Step 3: Convert string to datetime in DataFrame
In order to convert a column stored as a object/string into a DataFrame you can try the next:
Now after a check you can expect to have type
Note: In order to avoid errors related to different timestamp formats you can use this parameter:
df.datetime_col=pd.to_datetime(df.datetime_col, utc= True)
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
Step 4: Select rows between two dates
1. Select rows based on dates with loc
If all the previous steps are done then you can apply the selection based on dates. One possible way to do this is by next:
start_date = pd.to_datetime('6/11/2018 9:32', utc= True) end_date = pd.to_datetime('4/19/2020 11:42', utc= True) df.loc[(df['datetime_col'] > start_date) & (df['datetime_col'] < end_date)]
this will filter all results between this two dates.
2. Use Series function between
A Pandas Series function
between can be used by giving the start and end date as Datetime. This is my preferred method to select rows based on dates.:
3. Select rows between two times
Sometimes you may need to filter the rows of a DataFrame based only on time. In this case you can use function: pandas.DataFrame.between_time
In order this selection to work you need to have index which is DatetimeIndex. This can be achieved by:
df = df.set_index(['datetime_col'])
4. Select rows based on dates without loc
Another possible way to achieve similar result is by:
df[(df['datetime_col'] > '2018-12-02') & (df['datetime_col'] <= '2018-12-03 23:26:10+00:00')]
Be careful because this option will work even if you try to use non Datetime columns and the result might be unexpected.
5. Use mask to mark the records
Final option is combination of several previous methods:
mask = (df['datetime_col'] > start_date) & (df['datetime_col'] <= end_date) df.loc[mask]
This will filter the rows based on the mask - the mask can be reused later for different logselection and the DataFrame is not changed.
6. Select records from last month/30 days
Sometimes you will need to work with data from the last month/week/days. This can be done by:
df[df["datetime_col"] >= (pd.to_datetime('8/24/2019', utc=True) - pd.Timedelta(days=30))]
There are two things to be considered in this example:
pd.to_datetime('8/24/2019', utc=True)- defines a date from which you are going to calculate
pd.Timedelta(days=30)- how many days you like to shift back.
If you try to convert column which is not a date by:
df.name=pd.to_datetime(df.name)you will get the following error:
ValueError: ('Unknown string format:', 'Pandas')
If you try to use pandas:
df.between_time(start_date, end_date)with index which is not DatetimeIndex:
TypeError: Index must be DatetimeIndex
In case of comparison between Datetime objects with different format like:
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects