Pandas: How to filter results of value_counts?
This post will show you two ways to filter value_counts results with Pandas or how to get top 10 results.
From the article you can find also how the value_counts works, how to filter results with isin
and groupby/lambda.
Suppose that you have a Pandas DataFrame that contains columns with limited number of entries. Some values are also listed few times while others more often. Best way to get the counts for the values of this column is to use value_counts()
.
Now let say that you would like to filter it so that it only shows items that are present exactly/at least/at most n times.
Notebook: 22.pandas-how-to-filter-results-of-value_counts.ipynb
Video Tutorial
Step #1: How value_counts works
How value_counts works? Understanding of this question will help you understanding the next steps. value_counts it's a Pandas Series method which returns most frequently-occurring elements in descending order. So in other words:
df[col].value_counts()
outputs:
English 4704
French 73
Spanish 40
Hindi 28
....
Name: language, dtype: int64
now if we apply dtypes
we will get(the same is visible at the bottom of value_counts():
df[col].value_counts().dtypes
result:
dtype('int64')
You can get all the counts/frequency from the method by using:
df[col].value_counts().values
output:
array([4704, 73, 40, 28, 26, ...
Or you can get the items which are counted by:
df[col].value_counts().index
result:
Index(['English', 'French', 'Spanish', 'Hindi',...
In order to get top 10 results with Pandas/Python you can use:
df['language'].value_counts().head(10)
result:
English 4704
French 73
Spanish 40
Hindi 28
Mandarin 26
German 19
Japanese 18
Russian 11
Italian 11
Cantonese 11
Name: language, dtype: int64
Step #2: Filter value_counts with isin
In this step we will see how to get top/bottom results of value count and how to filter rows base on it. Knowing a bit more about value_counts
we will use it in order to filter the items which are present exactly 3 times in a given column:
df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)].language
This will result in next:
2388 Chinese
2740 Thai
3022 Chinese
3311 Thai
3427 Chinese
3659 Thai
Note that we get all rows which are part of the selection but return only the language column. If you like to get only the unique values you can add method unique
:
df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() == 3].index)].language.unique()
output:
array(['Chinese', 'Thai'], dtype=object)
In order to understand how the code above works we will deconstruct it:
First step is to use value_counts in direct comparison like:
df['language'].value_counts()==3
this will produce:
English False
...
Dutch False
Chinese True
Thai True
Icelandic False
and use the index of value_counts plus isin
df[col].isin(df[col].value_counts().index)
result in:
0 True
1 True
2 True
3 True
4 False
Where all values which are listed in value_counts results have True and False is for NaN values. So if this filtering is applied we will get all rows part of the value_counts.
Next step is to create mapping for value_counts, apply filtering and get value. To the above code we will add isin
:
df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==1].index)
result in:
Chinese 3
Thai 3
And this is how the code above works. Now you can changed in order to get the values which have count at least 10 times:
df['language'].value_counts()[df['language'].value_counts()> 10]
result in:
English 4704
French 73
Spanish 40
Hindi 28
Mandarin 26
German 19
Japanese 18
Russian 11
Cantonese 11
Italian 11
or get the rows for those values by:
df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() > 10].index)].language
0 English
1 English
2 English
3 English
5 English
...
5038 English
5039 English
Step #3: Use group by and lambda to simulate filter on value_counts()
The same result can be achieved even without using value_counts()
. We are going to use groubpy
and filter:
df.groupby('language').filter(lambda x: len(x) == 3).language
This will produce all rows which for column language
have values present exactly 3 times.
2388 Chinese
2740 Thai
3022 Chinese
3311 Thai
3427 Chinese
3659 Thai
if you want to get the whole rows you can use:
df.groupby('language').filter(lambda x: len(x) == 3)
result in:
2388 Chinese
2740 Thai
3022 Chinese
3311 Thai
3427 Chinese
3659 Thai
or add unique for the values only:
df.groupby('language').filter(lambda x: len(x) == 3)['language'].unique()
result:
array(['Chinese', 'Thai'], dtype=object)
Step #4: Bonus: Which is faster?
If you want to understand which one you should use - then you need to consider which is faster. We will measure timings by using timeit
which for Jupyter Notebook has this syntax:
%timeit df.groupby('language').filter(lambda x: len(x) == 3)['language']
result:
100 loops, best of 3: 10.9 ms per loop
%timeit df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)]['language']
result:
100 loops, best of 3: 3.19 ms per loop
So it seems that for this case value_counts
and isin
is 3 times faster than simulation of groupby
. But on the other hand the groupby
example looks a bit easier to understand and change. As always Pandas and Python give us more than one way to accomplish one task and get results in several different ways.