Filtering pandas dataframe by list of a values is a common operation in data science world. You have two main ways of selecting data:
- select pandas rows by exact match from a list
- filter pandas rows by partial match from a list
Related resources:
Also pandas offers big variety of options to solve those problems. I'll recommend to use vectorized operations when it's possible because it's much faster:
Vectorization is the process of executing operations on entire arrays.
So let say that we have this data(Value count for a given column):
| Value | Count |
|---|---|
| Another engineering discipline (ex. civil, electrical, mechanical) | 6945 |
| Information systems, information technology, or system administration | 6507 |
| A natural science (ex. biology, chemistry, physics) | 3050 |
| Mathematics or statistics | 2818 |
| Web development or web design | 2418 |
and our goal is to find are this values part of the column and create a series with it:
area_list = ['biology', 'physics', 'Computer', 'enginnering']
to get output like:
| biology | physics | Computer | enginnering | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | True | True | False | False |
| 2 | False | False | True | False |
| 3 | False | False | True | False |
| 4 | False | False | True | False |
and total count:
| biology | physics | |
|---|---|---|
| False | 73294 | 85904 |
| True | 18804 | 6194 |
This can be done by using this code:
import re
area_df = pd.DataFrame(dict((area, df.UndergradMajor.str.contains(area))
for area in area_list))
where:
- we create a new dataframe for the result
- use vectorized function
str.containsin order to verify if the value is part of the column - create a dictionary for the result of the all values
This example show a partial match. If you want to use a full match than you can use another vectorized method from pandas which is str.isin. This is how to filter rows by exact match for the values of a list:
df[df['UndergradMajor'].isin(['Mathematics or statistics',
'Web development or web design'])]
This will filter the rows of the dataframe which contains exactly the values from the list.
The bonus tip for today is how to apply value_counts for the whole dataframe or several columns. This can be done by:
df.apply(pd.Series.value_counts)
the result will be:
| Mobile | Data | QA | |
|---|---|---|---|
| False | 73294 | 70209 | 85904 |
| True | 18804 | 21889 | 6194 |
And perform value counts for several columns:
df[['Mobile','QA']].apply(pd.Series.value_counts)
the result will be:
| Mobile | QA | |
|---|---|---|
| False | 73294 | 85904 |
| True | 18804 | 6194 |