Python Pandas extract URL or date by regex
In Pandas extraction of string patterns is done by methods like - str.extract or str.extractall which support regular expression matching. The extract method support capture and non capture groups.
Note: The difference between string methods: extract and extractall is that first match and extract only first occurrence, while the second will extract everything!
This is how the method is described in Pandas official documentation:
Extract capture groups in the regex pat as columns in a DataFrame.
For each subject string in the Series, extract groups from the first match of regular expression pat.
Examples and data: can be found on my github repository ( you can find many different examples there ): Pandas extract url and date from column
Data for the examples is stored in CSV file which is read with pandas:
log | ||
---|---|---|
0 | "2019-10-28 19:56:03 DEMO <GET https://www.wikipedia.org/> (The Free Encyclopedia) 2019-10-29 9:06:03" | |
1 | "2019-10-29 19:56:03 DEMO <GET https://en.wikipedia.org/wiki/Main_Page> (5,962,233 articles in English) 2019-10-31 11:16:43" | |
2 | "2019-10-29 19:56:03 DEMO <GET https://it.wikipedia.org/wiki/Pagina_principale> (1 561 730 voci in italiano) 2019-10-30 21:15:23" | |
3 | "2019-10-30 19:56:03 DEMO <GET https://pt.wikipedia.org/wiki/Wikipédia:Página_principal> (1 014 783 artigos em português) 2019-10-30 20:26:35" |
More resources from me:
- Pandas dataframe search for string in all columns filter Regex
- Python Regex match date
- Tag Regex - many different examples for regex
How to extract URL(s) from pandas Dataframe
In first example extraction will be done for a specific netloc/domain which is the easier problem to solve - when all URLs share specific domain and there is only one domain in the column.
import pandas as pd
result = pd.read_csv('../csv/url_single.csv')
result['url'] = result.log.str.extract(r'(https.*)(?:>)').head()
result in:
URL
https://www.wikipedia.org/
https://en.wikipedia.org/wiki/Main_Page
...
In case of many different domains and need of filtering then you can apply the filtering on the resulted data:
result[result['url'].str.contains('it.wikipedia.org')]
Python Regex templates for URLs
- simple one - with one protocol and well defined ending -
>
(https.*)(?:>)
- pattern which match different protocols and templates
(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})
- Separate extraction for protocol, subdomain, netloc, parameters
r'(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?'
How to extract Date(s) from pandas Dataframe
Lets use the same data for extraction. This time our goal is to extract all dates from the column - log. The example below demonstrate how to do this:
result['date'] = result.log.str.extract(r'(\d{4}-\d{2}-\d{2})')
output:
2019-10-28
2019-10-29
2019-10-29
2019-10-30
Extract multiple dates from a single column
As you may noticed already - we have two dates in the column but only one of them is extracted. In order to extract all matches of dates from this column by regex method extractall should be used:
result.log.str.extractall(r'(\d{4}-\d{2}-\d{2})')
But this time the result is a MultiIndex:
- first levels that come from the subject Series
- last level is named ‘match’ and indexes the matches in each item of the Series
0 | ||
---|---|---|
match | ||
0 | 0 | 2019-10-28 |
1 | 2019-10-29 | |
1 | 0 | 2019-10-29 |
1 | 2019-10-31 | |
2 | 0 | 2019-10-29 |
1 | 2019-10-30 | |
3 | 0 | 2019-10-30 |
1 | 2019-10-30 |
If the format of the output don't satisfy your needs then you can use method unstack to change the result:
result.log.str.extractall(r'(\d{4}-\d{2}-\d{2})').unstack()
output:
| 0 |
match | 0 | 1 |
---|---|---|
0 | 2019-10-28 | 2019-10-29 |
1 | 2019-10-29 | 2019-10-31 |
2 | 2019-10-29 | 2019-10-30 |
3 | 2019-10-30 | 2019-10-30 |
Extract datetime from a column with Pandas
As a bonus if you like to extract datetime you can apply the next regular expression pattern:
# extract datetime
result['datetime'] = result.log.str.extract(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')
result:
0 2019-10-28 19:56:03
1 2019-10-29 19:56:03
match datetime, extract only date
Sometimes you need to match bigger pattern but extract only a part of it. In this case non capture grouped should be used - which is marked by ?:
. Non capture groups are matched but not extracted!
# extract match datetime extract only date
result['date'] = result.log.str.extract(r'(\d{4}-\d{2}-\d{2}) (?:\d{2}-\d{2}-\d{2})')
result:
0 2019-10-28
1 2019-10-29
...
match datetime extract separate date and time
In case that you need to match pattern like datetime but then you need to have data separated in different columns - date and time - then you can work with capturing groups - the syntax for those kind of group is parenthesis: ()
# match datetime extract only date
result['date', 'time'] = result.log.str.extract(r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2})')
result:
| | date | time |
|---|------------|----------|
| 0 | 2019-10-28 | 19:56:03 |
| 1 | 2019-10-29 | 19:56:03 |
| 2 | 2019-10-29 | 19:56:03 |
| 3 | 2019-10-30 | 19:56:03 |
Alternative method for extraction (non using regex)
Sometimes regular expressions are too complicated or can take a lot of time for computation. In some cases simpler and faster alternative is using another string method str.split. Which is suitable for well structured date ( as in the example above:
result['url_split'] = 'https' + result.log.str.split('https', expand=True)[1].str.split('>', expand=True)[0]
which will result in:
0 https://www.wikipedia.org/
1 https://en.wikipedia.org/wiki/Main_Page
2 https://it.wikipedia.org/wiki/Pagina_principale
3 https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal