How to Match text between two strings with regex in Python

To match any text between two strings/patterns with a regular expression in Python you can use:

re.search(r'pattern1(.*?)pattern2', s).group(1)

In the next sections, you’ll see how to apply the above using a simple example.

In this example we are using a Kaggle dataset. If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

Step 1: Match Text Between Two Strings

To start with a simple example, let’s have a the next text:

step 1 some text step 2 more text step 3 then more text

and we would like to extract everything between step 1 and step 2. To do so we are going to use capture group like:

import re
s = 'step 1 some text step 2 more text step 3 then more text'
re.search(r'step 1(.*?)step 2', s).group(1)

result:

' some text '

How it works:

step 1 - matches the characters step 1 literally (case sensitive)
(.*?) - matches any character between zero and unlimited times expanding as needed (lazy)
step 2 - matches the characters step 2 literally (case sensitive)

Non lazy search

The previous example will stop until it finds text which satisfies it. If you like to extract:

some text step 2 more text

Then you need to change the search to:

re.findall(r'step \d (.*) step \d', s)

Step 2: Match Text Between Two Patterns

Now let's say that you would like to match a pattern and not fixed text. In this example we will see how to extract step followed by a digit:

import re
s = 'step 1 some text\nstep 2 more text\nstep 3 then more text\nconclusion'
re.findall(r'(?:step \d)(.*?)(?:\n)', s)

So having the next text:

step 1 some text
step 2 more text
step 3 then more text
conclusion

We will extract:

[' some text', ' more text', ' then more text']

How does it work?

(?:step \d) - Non-capturing group - ?: - it will be matched but not extracted
- step \d - matches the characters step literally (case sensitive) followed by a digit (equivalent to [0-9])
(.*?) - 1st Capturing Group - capture anything lazy mode
(?:\n) - Non-capturing group
- \n matches a newline character

Step 3: Match Text Between Two Patterns Lazy vs Greedy

In this step we will give a more explanation to the lazy vs greedy match. The difference can be explained as:

@(.*?)@ - Lazy
@(.*)@ - Greedy

So let say that we have a list of mails like:

|[email protected]; |[email protected];|[email protected];

If we do a greedy extraction between to strings we will get:

s = "|[email protected]; |[email protected];|[email protected];"
re.findall(r'\|(.*);', s)

result will be only 1 match from the first | to the last ;:

['[email protected]; |[email protected];|[email protected]']

While if we do a lazy search for a text between two substrings then we will get:

s = 'step 1 some text\nstep 2 more text\nstep 3 then more text\nconclusion'
re.findall(r'(?:step \d)(.*?)(?:\n)', s)

3 separate strings as:

[' some text', ' more text', ' then more text']

match-text-between-two-strings-regex-in-python

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

Step 1: Match Text Between Two Strings

Non lazy search

Step 2: Match Text Between Two Patterns

Step 3: Match Text Between Two Patterns Lazy vs Greedy