Looking for SQL like operator in Pandas?

If so, let's check several examples of Pandas text matching simulating Like operator.

To start, here is a sample DataFrame which will be used in the next examples:

data = {'num_legs': [4, 2, 0, 4, 2, 2],
        'num_wings': [0, 2, 0, 0, 2, 2],
         'class':['mammal', 'bird', 'fish', 'mammal', None,'mammal']}
df = pd.DataFrame(data,
                  index=['dog', 'hawk', 'shark', 'cat', 'crow', 'human'])

data:

num_legs num_wings class
dog 4 0 mammal
hawk 2 2 bird
shark 0 0 fish
cat 4 0 mammal
crow 2 2
human 2 2 mammal

Example 1: Pandas find rows which contain string

The first example is about filtering rows in DataFrame which is based on cell content - if the cell contains a given pattern extract it otherwise skip the row. Let's get all rows for which column class contains letter i:

df['class'].str.contains('i', na=False)

this will result in Series of True and False:

dog False
hawk True
shark True
cat False
crow False
human False

If you like to get the the whole row then you can use: df[df['class'].str.contains('i', na=False)]

num_legs num_wings class
hawk 2 2 bird
shark 0 0 fish

Note: na=False will skip rows with None values. If you need them - use na=True. In case that parameter na is not specified then error will be raised:

ValueError: Cannot mask with non-boolean array containing NA / NaN values

Example 2: Pandas simulate Like operator and regex

Second example will demonstrate the usage of Pandas contains plus regex. Activating regex matching is done by regex=True. The pipe operator 'sh|rd' is used as or:

df[df['class'].str.contains('sh|rd', regex=True, na=True)]

The code above will search for all rows which contains:

  • sh
  • rd
  • None

so the output is:

num_legs num_wings class
hawk 2 2 bird
shark 0 0 fish
crow 2 2

More regex examples:

  • match rows which digits - df['class'].str.contains('\d', regex=True)
  • match rows case insensitive - df['class'].str.contains('bird', flags=re.IGNORECASE, regex=True)

Note: Usage of regular expression might slow down the operation in magnitude for bigger DataFrames

Example 3: Pandas match rows starting with text

Let's find all rows with index starting by letter h by using function str.startswith:

df[df.index.str.startswith('h', na=False)]

output:

num_legs num_wings class
hawk 2 2 bird
human 2 2 mammal

Example 4: Pandas match rows ending with text

The same logic can be applied with function: .str.endswith in order to rows which values ends with a given string:

df[df.index.str.endswith('k', na=False)]

output:

num_legs num_wings class
hawk 2 2 bird
shark 0 0 fish

Example 5: Pandas Like operator with Query

Pandas queries can simulate Like operator as well. Let's find a simple example of it. Here is the moment to point out two points:

  • naming columns with reserved words like class is dangerous and might cause errors
  • the other culprit for errors are None values.

So in order to use query plus str.contains we need to rename column class to classd and fill the None values.

Then we can do:

df.query('classd.str.contains("i")', engine='python')

which will result in:

num_legs num_wings class
hawk 2 2 bird
shark 0 0 fish

or combination with other conditions:

df.query('classd.str.contains("i") and classd.str.endswith("d") ', engine='python')

the output is:

num_legs num_wings classd
hawk 2 2 bird

Step 6: Pandas Like operator match numbers only

For this example we are going to use numeric Series like:

s = pd.Series(['20.03', '11', '23.0', '65', '60', 'a', None])

Return all rows with numbers:

s.str.contains('\d', regex=True)

output:

[True, True, True, True, True, False, None]

Return decimal numbers:

s.str.contains('\d.\d', regex=True)

result:

[True, False, True, False, False, False, None]

How to filter for decimal numbers which have 0 after the point like 20.03, 23.0: Is pattern .0 good enough?

s.str.contains('.0', regex=True)

No - because 60 is matched too:

[True, False, True, False, True, False, None]

The reason is that pattern .0 matches any character followed by a 0. Searching for floating numbers with dot followed by 0 is done by:

s.str.contains('\.0', regex=True)

result:

[True, False, True, False, False, False, None]

Step 7: Pandas SQL Like operator

There is a python module: pandasql which allows SQL syntax for Pandas. It can be installed by:

pip install -U pandasql

Basic example:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Like operator:

sqldf("select * from df where classd like 'h%';", locals())

Resources