Pandas : Crosstab - cross tabulation of two (or more) factors - usage and examples
In this article you can learn:
-
What is crosstab and how to use it?
-
How to show percentage and totals
-
Several example for advanced usage
-
Business use
What is Pandas crosstab?
Pandas crosstab can be considered as pivot table equivalent ( from Excel or LibreOffice Calc). It shows summary as tabular representation based on several factors. The information can be presented as counts, percentage, sum, average or other statistical methods.
The official Pandas Documentation describe it as:
Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
In this example we are going to work with movies dataset: IMDB 5000 movie dataset. By using cross-tabulation we will try to answer to several question related to this dataset:
- How many movies does actor have per country?
- What is the percentage per actor and per country?
- What is the total per country and per actor?
- Average IMDB rating per actor and per country
Steps to use Pandas crosstab
Steps 1: Import Pandas and read data
The first steps in order to use pandas cross-tabulation method is to read your data and create DataFrame object. The next lines of code show how to create DataFrame from CSV file:
import pandas as pd
df = pd.read_csv("../csv/movie_metadata.csv")
The input for cross-tabulation is categorical data. In the event that there aren't overlapping indexes an empty DataFrame will be returned.
Step 2: Select data for the crosstab
Next you need to analyze your data and select the values which best represent your problem or question. Typical benefits of using crosstab or pivot tables are:
- Data summary
- Data aggregation
- Grouping
- Quick Reports
- Data patterns
Usually I investigate Pandas DataFrame by getting several records from it:
df.head().T
or checking column names:
df.columns
Step 3: Create cross-tabulation table
The last step is to build your crosstab. The cross-tabulation includes many different options and parameters which make it really powerful tool for data analysis.
The most simple usage of Pandas crosstab is:
pd.crosstab(df2['director_name'], df2['country']
This will result in summary table like:
country | Australia | Canada | New Zealand | UK | USA |
---|---|---|---|---|---|
director_name | |||||
Baz Luhrmann | 1 | 0 | 0 | 0 | 0 |
Brett Ratner | 0 | 1 | 0 | 0 | 0 |
David Yates | 0 | 0 | 0 | 1 | 0 |
Gore Verbinski | 0 | 0 | 0 | 0 | 2 |
Jon Favreau | 0 | 0 | 0 | 1 | 0 |
Marc Forster | 0 | 0 | 0 | 1 | 0 |
Peter Jackson | 0 | 0 | 2 | 0 | 1 |
Sam Mendes | 0 | 0 | 0 | 2 | 0 |
Pandas crosstabe can be used also for multiple columns for rows or columns. The syntax for multiple rows can be seen below:
pd.crosstab([df2['director_name'], df2['genres']], df2['country'])
This will add one more level or will do MultiIndex for your cross-tabulation.
country | Australia | Canada | New Zealand | UK | USA | |
---|---|---|---|---|---|---|
director_name | genres | |||||
Baz Luhrmann | Drama;Romance | 1 | 0 | 0 | 0 | 0 |
Brett Ratner | Action;Adventure;Fantasy;Sci-Fi;Thriller | 0 | 1 | 0 | 0 | 0 |
David Yates | Adventure;Family;Fantasy;Mystery | 0 | 0 | 0 | 1 | 0 |
Gore Verbinski | Action;Adventure;Fantasy | 0 | 0 | 0 | 0 | 1 |
Action;Adventure;Western | 0 | 0 | 0 | 0 | 1 | |
Jon Favreau | Adventure;Drama;Family;Fantasy | 0 | 0 | 0 | 1 | 0 |
Marc Forster | Action;Adventure | 0 | 0 | 0 | 1 | 0 |
Peter Jackson | Action;Adventure;Drama;Romance | 0 | 0 | 1 | 0 | 0 |
Adventure;Fantasy | 0 | 0 | 1 | 0 | 1 | |
Sam Mendes | Action;Adventure;Thriller | 0 | 0 | 0 | 2 | 0 |
Step 4: Use percentage and totals
If the basic usage doesn't satisfy your needs you can go further by using percentage and/or add totals for rows and columns. In order to achieve this you can use two parameters:
- margins : bool, default False - Add row/column margins (subtotals)
- normalize : bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False - Normalize by dividing all values by the sum of values.
Example usage of it would be:
pd.crosstab(df2['director_name'], df2['country'], margins=True, normalize='index')
Which will result in:
country | Australia | Canada | New Zealand | UK | USA |
---|---|---|---|---|---|
director_name | |||||
Baz Luhrmann | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Brett Ratner | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
David Yates | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
Gore Verbinski | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
Jon Favreau | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
Marc Forster | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
Peter Jackson | 0.000000 | 0.000000 | 0.666667 | 0.000000 | 0.333333 |
Sam Mendes | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
All | 0.083333 | 0.083333 | 0.166667 | 0.416667 | 0.250000 |
Note: that combination of both will result in totals only for rows!
Step 5: Use values from another column and aggregation function
The final step is to use values from a different column and aggregation function like sum or average. How to achieve this is visible from the example below:
import numpy as np
pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.sum)
This will result in:
country | Australia | Canada | New Zealand | UK | USA |
---|---|---|---|---|---|
director_name | |||||
Baz Luhrmann | 7.3 | NaN | NaN | NaN | NaN |
Brett Ratner | NaN | 6.8 | NaN | NaN | NaN |
David Yates | NaN | NaN | NaN | 7.5 | NaN |
Gore Verbinski | NaN | NaN | NaN | NaN | 6.9 |
Jon Favreau | NaN | NaN | NaN | 7.8 | NaN |
Marc Forster | NaN | NaN | NaN | 6.7 | NaN |
Peter Jackson | NaN | NaN | 7.35 | NaN | 7.9 |
Sam Mendes | NaN | NaN | NaN | 7.3 | NaN |
Crosstab can be simulated with groupby
If you don't like the idea of using crosstab then you can use combination of groupby and count (or other functions) to achieve similar result. The example below show how to simulate the basic usage:
cols = ['director_name', 'country']
df2.groupby(cols)[cols].count()
Personally I prefer to use crosstab because it's easier to work with and give much more options. Also show my intention what is the final goal of my code.
I recommend experimenting and trying different examples with crosstab and finally when this powerful technique is mastered to be used.