Pandas - value_counts - multiple columns, all columns and bad data

Pandas - value_counts - multiple columns, all columns and bad data

How to apply Pandas value_counts on multiple columns or all columns of a DataFrame at Once? How to use value_counts in case of bad data - like typos and case sensitive user inputs. You can find from the next 3 examples. These solutions can be easily used for analyses on CSV files or excel files.

If you are interested in value_counts - use percentage and better output you can check this article: Pandas count and percentage by value for a column

Video Tutorial

Note: The examples in this article were inspired from my participation in Kaggle competition: 2019 Kaggle ML & DS Survey and this is my notebook on the topic: A Short History of the Data-science

Datasets:

Notebook with the examples: 0.Pandas-value_counts-_multiple_columns%2C_all_columns_and_bad_data.ipynb

Pandas apply value_counts on multiple columns at once

The first example show how to apply Pandas method value_counts on multiple columns of a Dataframe ot once by using pandas.DataFrame.apply. This solution is working well for small to medium sized DataFrames. The syntax is simple - the first one is for the whole DataFrame:

df_movie.apply(pd.Series.value_counts).head()

While the second example is only for a few columns from a given DataFrame:

df_movie[['color', 'content_rating']].apply(pd.Series.value_counts)

This results in summary for the listed columns as:

color content_rating
Black and White 209.0 NaN
Approved NaN 55.0
Color 4815.0 NaN
G NaN 112.0
GP NaN 6.0
M NaN 5.0
NC-17 NaN 7.0
Not Rated NaN 116.0
PG NaN 701.0
PG-13 NaN 1461.0
Passed NaN 9.0
R NaN 2118.0
TV-14 NaN 30.0

Pandas apply value_counts on all columns

Another solution for a bigger DataFrames which helps me to quickly explore stored data and possibly problems with data is by getting top values for each column. This is done with simple loop and applying value_counts and printing the results:

for col in df_resp.columns:
    print('-' * 40 + col + '-' * 40 , end=' - ')
    display(df_resp[col].value_counts().head(10))

Example result:

----------------------------------------Q14_OTHER_TEXT---------------------------------------- - 
Python    89
python    45
None      36
Matlab    28
none      22
Name: Q14_OTHER_TEXT, dtype: int64
----------------------------------------Q14_Part_1_TEXT---------------------------------------- - 
Excel              865
Microsoft Excel    392
excel              263
MS Excel            67
Google Sheets       61
Name: Q14_Part_1_TEXT, dtype: int64

As you can see some of the results are quite similar which shows bad data. This technique works very well with big DataFrames which are not known for me. In order to clean this data you can move to next section.

Pandas apply value_counts on column with bad data

As we saw already in the previous section some of the answers contains - duplicated data stored in several different variants like:

  • 'Tableau',
  • 'Power BI',
  • 'tableau',
  • 'PowerBI',
  • 'Salesforce',
  • 'Tableau ',
  • 'Qlik',
  • 'Power Bi',
  • 'Power bi'

One way to improve the situation is by using library like: difflib which can recognize similar words and group them like:

  • 'Power BI', 'PowerBI', 'Power Bi', 'Power bi'

This is how difflib.get_close_matches works:

difflib.get_close_matches('Power BI', ['Power BI', 'tableau', 'PowerBI', 'Power Bi','Salesforce', 'Tableau ', 'Qlik',  'Power bi'], n=3, cutoff=0.6)

We are providing:

  • keyword - 'Power BI'
  • list of keywords - 'Power BI', 'tableau' ...
  • how many results we expect - 3
  • and finally the probability cutoff

and get the closest matches to our keyword.

After that we are going to create new column with this data and apply value_counts on cleaned data. The next code snippet shows this:

import difflib 

correct_values = {}
words = df_resp.Q14_Part_3_TEXT.value_counts(ascending=True).index

for keyword in words:
    similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6)
    for x in similar:
        correct_values[x] = keyword
             
df_resp["corr"] = df_resp["Q14_Part_3_TEXT"].map(correct_values)

You can find the two outputs below:

after cleaning:

Tableau                                   345
Power BI                                  137
Salesforce                                 43
Qlik                                       27
Spotfire                                   17

before cleaning:

Tableau                    260
Power BI                    71
tableau                     51
PowerBI                     23
Salesforce                  19

The data is still not perfect but it's much better than initial state of the data. And we can do better analysis of it.

Share Tweet Send
0 Comments
Loading...