Filtering pandas dataframe by list of a values is a common operation in data science world. You have two main ways of selecting data:

  • select pandas rows by exact match from a list
  • filter pandas rows by partial match from a list

Related resources:

Also pandas offers big variety of options to solve those problems. I'll recommend to use vectorized operations when it's possible because it's much faster:

Vectorization is the process of executing operations on entire arrays.

So let say that we have this data(Value count for a given column):

Value Count
Another engineering discipline (ex. civil, electrical, mechanical) 6945
Information systems, information technology, or system administration 6507
A natural science (ex. biology, chemistry, physics) 3050
Mathematics or statistics 2818
Web development or web design 2418

and our goal is to find are this values part of the column and create a series with it:

area_list = ['biology', 'physics', 'Computer', 'enginnering']

to get output like:

biology physics Computer enginnering
0 False False False False
1 True True False False
2 False False True False
3 False False True False
4 False False True False

and total count:

biology physics
False 73294 85904
True 18804 6194

This can be done by using this code:

import re
area_df = pd.DataFrame(dict((area, df.UndergradMajor.str.contains(area))
                             for area in area_list))

where:

  • we create a new dataframe for the result
  • use vectorized function str.contains in order to verify if the value is part of the column
  • create a dictionary for the result of the all values

This example show a partial match. If you want to use a full match than you can use another vectorized method from pandas which is str.isin. This is how to filter rows by exact match for the values of a list:

df[df['UndergradMajor'].isin(['Mathematics or statistics', 
                              'Web development or web design'])]

This will filter the rows of the dataframe which contains exactly the values from the list.

The bonus tip for today is how to apply value_counts for the whole dataframe or several columns. This can be done by:

df.apply(pd.Series.value_counts)

the result will be:

Mobile Data QA
False 73294 70209 85904
True 18804 21889 6194

And perform value counts for several columns:

df[['Mobile','QA']].apply(pd.Series.value_counts)

the result will be:

Mobile QA
False 73294 85904
True 18804 6194