Filtering pandas dataframe by list of a values is a common operation in data science world. You have two main ways of selecting data:
- select pandas rows by exact match from a list
- filter pandas rows by partial match from a list
Related resources:
Also pandas offers big variety of options to solve those problems. I'll recommend to use vectorized operations when it's possible because it's much faster:
Vectorization is the process of executing operations on entire arrays.
So let say that we have this data(Value count for a given column):
Value | Count |
---|---|
Another engineering discipline (ex. civil, electrical, mechanical) | 6945 |
Information systems, information technology, or system administration | 6507 |
A natural science (ex. biology, chemistry, physics) | 3050 |
Mathematics or statistics | 2818 |
Web development or web design | 2418 |
and our goal is to find are this values part of the column and create a series with it:
area_list = ['biology', 'physics', 'Computer', 'enginnering']
to get output like:
biology | physics | Computer | enginnering | |
---|---|---|---|---|
0 | False | False | False | False |
1 | True | True | False | False |
2 | False | False | True | False |
3 | False | False | True | False |
4 | False | False | True | False |
and total count:
biology | physics | |
---|---|---|
False | 73294 | 85904 |
True | 18804 | 6194 |
This can be done by using this code:
import re
area_df = pd.DataFrame(dict((area, df.UndergradMajor.str.contains(area))
for area in area_list))
where:
- we create a new dataframe for the result
- use vectorized function
str.contains
in order to verify if the value is part of the column - create a dictionary for the result of the all values
This example show a partial match. If you want to use a full match than you can use another vectorized method from pandas which is str.isin
. This is how to filter rows by exact match for the values of a list:
df[df['UndergradMajor'].isin(['Mathematics or statistics',
'Web development or web design'])]
This will filter the rows of the dataframe which contains exactly the values from the list.
The bonus tip for today is how to apply value_counts
for the whole dataframe or several columns. This can be done by:
df.apply(pd.Series.value_counts)
the result will be:
Mobile | Data | QA | |
---|---|---|---|
False | 73294 | 70209 | 85904 |
True | 18804 | 21889 | 6194 |
And perform value counts for several columns:
df[['Mobile','QA']].apply(pd.Series.value_counts)
the result will be:
Mobile | QA | |
---|---|---|
False | 73294 | 85904 |
True | 18804 | 6194 |