Pandas - value_counts - multiple columns, all columns and bad data
How to apply Pandas value_counts on multiple columns or all columns of a DataFrame at Once? How to use value_counts in case of bad data - like typos and case sensitive user inputs. You can find from the next 3 examples. These solutions can be easily used for analyses on CSV files or excel files.
If you are interested in value_counts - use percentage and better output you can check this article: Pandas count and percentage by value for a column
Note: The examples in this article were inspired from my participation in Kaggle competition: 2019 Kaggle ML & DS Survey and this is my notebook on the topic: A Short History of the Data-science
Datasets:
Notebook with the examples: 0.Pandas-value_counts-_multiple_columns%2C_all_columns_and_bad_data.ipynb
Pandas apply value_counts on multiple columns at once
The first example show how to apply Pandas method value_counts on multiple columns of a Dataframe ot once by using pandas.DataFrame.apply. This solution is working well for small to medium sized DataFrames. The syntax is simple - the first one is for the whole DataFrame:
df_movie.apply(pd.Series.value_counts).head()
While the second example is only for a few columns from a given DataFrame:
df_movie[['color', 'content_rating']].apply(pd.Series.value_counts)
This results in summary for the listed columns as:
color | content_rating | |
---|---|---|
Black and White | 209.0 | NaN |
Approved | NaN | 55.0 |
Color | 4815.0 | NaN |
G | NaN | 112.0 |
GP | NaN | 6.0 |
M | NaN | 5.0 |
NC-17 | NaN | 7.0 |
Not Rated | NaN | 116.0 |
PG | NaN | 701.0 |
PG-13 | NaN | 1461.0 |
Passed | NaN | 9.0 |
R | NaN | 2118.0 |
TV-14 | NaN | 30.0 |
Pandas apply value_counts on all columns
Another solution for a bigger DataFrames which helps me to quickly explore stored data and possibly problems with data is by getting top values for each column. This is done with simple loop and applying value_counts and printing the results:
for col in df_resp.columns:
print('-' * 40 + col + '-' * 40 , end=' - ')
display(df_resp[col].value_counts().head(10))
Example result:
----------------------------------------Q14_OTHER_TEXT---------------------------------------- -
Python 89
python 45
None 36
Matlab 28
none 22
Name: Q14_OTHER_TEXT, dtype: int64
----------------------------------------Q14_Part_1_TEXT---------------------------------------- -
Excel 865
Microsoft Excel 392
excel 263
MS Excel 67
Google Sheets 61
Name: Q14_Part_1_TEXT, dtype: int64
As you can see some of the results are quite similar which shows bad data. This technique works very well with big DataFrames which are not known for me. In order to clean this data you can move to next section.
Pandas apply value_counts on column with bad data
As we saw already in the previous section some of the answers contains - duplicated data stored in several different variants like:
- 'Tableau',
- 'Power BI',
- 'tableau',
- 'PowerBI',
- 'Salesforce',
- 'Tableau ',
- 'Qlik',
- 'Power Bi',
- 'Power bi'
One way to improve the situation is by using library like: difflib which can recognize similar words and group them like:
- 'Power BI', 'PowerBI', 'Power Bi', 'Power bi'
This is how difflib.get_close_matches works:
difflib.get_close_matches('Power BI', ['Power BI', 'tableau', 'PowerBI', 'Power Bi','Salesforce', 'Tableau ', 'Qlik', 'Power bi'], n=3, cutoff=0.6)
We are providing:
- keyword - 'Power BI'
- list of keywords - 'Power BI', 'tableau' ...
- how many results we expect - 3
- and finally the probability cutoff
and get the closest matches to our keyword.
After that we are going to create new column with this data and apply value_counts on cleaned data. The next code snippet shows this:
import difflib
correct_values = {}
words = df_resp.Q14_Part_3_TEXT.value_counts(ascending=True).index
for keyword in words:
similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6)
for x in similar:
correct_values[x] = keyword
df_resp["corr"] = df_resp["Q14_Part_3_TEXT"].map(correct_values)
You can find the two outputs below:
after cleaning:
Tableau 345
Power BI 137
Salesforce 43
Qlik 27
Spotfire 17
before cleaning:
Tableau 260
Power BI 71
tableau 51
PowerBI 23
Salesforce 19
The data is still not perfect but it's much better than initial state of the data. And we can do better analysis of it.