Pandas: How to filter results of value_counts?

Pandas: How to filter results of value_counts?

This post will show you two ways to filter value_counts results with Pandas or how to get top 10 results.
From the article you can find also how the value_counts works, how to filter results with isin and groupby/lambda.

Suppose that you have a Pandas DataFrame that contains columns with limited number of entries. Some values are also listed few times while others more often. Best way to get the counts for the values of this column is to use value_counts().
Now let say that you would like to filter it so that it only shows items that are present exactly/at least/at most n times.

Notebook: 22.pandas-how-to-filter-results-of-value_counts.ipynb
Video Tutorial

Step #1: How value_counts works

How value_counts works? Understanding of this question will help you understanding the next steps. value_counts it's a Pandas Series method which returns most frequently-occurring elements in descending order. So in other words:

df[col].value_counts()

outputs:

English       4704
French          73
Spanish         40
Hindi           28
....
Name: language, dtype: int64

now if we apply dtypes we will get(the same is visible at the bottom of value_counts():

df[col].value_counts().dtypes

result:

dtype('int64')

You can get all the counts/frequency from the method by using:

df[col].value_counts().values

output:

array([4704,   73,   40,   28,   26, ...

Or you can get the items which are counted by:

df[col].value_counts().index

result:

Index(['English', 'French', 'Spanish', 'Hindi',...

In order to get top 10 results with Pandas/Python you can use:

df['language'].value_counts().head(10)

result:

English      4704
French         73
Spanish        40
Hindi          28
Mandarin       26
German         19
Japanese       18
Russian        11
Italian        11
Cantonese      11
Name: language, dtype: int64

Step #2: Filter value_counts with isin

In this step we will see how to get top/bottom results of value count and how to filter rows base on it. Knowing a bit more about value_counts we will use it in order to filter the items which are present exactly 3 times in a given column:

df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)].language

This will result in next:

2388    Chinese
2740       Thai
3022    Chinese
3311       Thai
3427    Chinese
3659       Thai

Note that we get all rows which are part of the selection but return only the language column. If you like to get only the unique values you can add method unique:

df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() == 3].index)].language.unique()

output:

array(['Chinese', 'Thai'], dtype=object)

In order to understand how the code above works we will deconstruct it:

First step is to use value_counts in direct comparison like:

df['language'].value_counts()==3

this will produce:

English       False
...
Dutch         False
Chinese        True
Thai           True
Icelandic     False

and use the index of value_counts plus isin

df[col].isin(df[col].value_counts().index)

result in:

0        True
1        True
2        True
3        True
4       False

Where all values which are listed in value_counts results have True and False is for NaN values. So if this filtering is applied we will get all rows part of the value_counts.

Next step is to create mapping for value_counts, apply filtering and get value. To the above code we will add isin :

df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==1].index)

result in:

Chinese    3
Thai       3

And this is how the code above works. Now you can changed in order to get the values which have count at least 10 times:

df['language'].value_counts()[df['language'].value_counts()> 10]

result in:

English      4704
French         73
Spanish        40
Hindi          28
Mandarin       26
German         19
Japanese       18
Russian        11
Cantonese      11
Italian        11

or get the rows for those values by:

df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() > 10].index)].language
0       English
1       English
2       English
3       English
5       English
         ...   
5038    English
5039    English

Step #3: Use group by and lambda to simulate filter on value_counts()

The same result can be achieved even without using value_counts(). We are going to use groubpy and filter:

df.groupby('language').filter(lambda x: len(x) == 3).language

This will produce all rows which for column language have values present exactly 3 times.

2388    Chinese
2740       Thai
3022    Chinese
3311       Thai
3427    Chinese
3659       Thai

if you want to get the whole rows you can use:

df.groupby('language').filter(lambda x: len(x) == 3)

result in:

2388    Chinese
2740       Thai
3022    Chinese
3311       Thai
3427    Chinese
3659       Thai

or add unique for the values only:

df.groupby('language').filter(lambda x: len(x) == 3)['language'].unique()

result:

array(['Chinese', 'Thai'], dtype=object)

Step #4: Bonus: Which is faster?

If you want to understand which one you should use - then you need to consider which is faster. We will measure timings by using timeit which for Jupyter Notebook has this syntax:

%timeit df.groupby('language').filter(lambda x: len(x) == 3)['language']

result:

100 loops, best of 3: 10.9 ms per loop
%timeit df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)]['language']

result:

100 loops, best of 3: 3.19 ms per loop

So it seems that for this case value_counts and isin is 3 times faster than simulation of groupby. But on the other hand the groupby example looks a bit easier to understand and change. As always Pandas and Python give us more than one way to accomplish one task and get results in several different ways.

Share Tweet Send
0 Comments
Loading...