In this article you can learn:

  • What is crosstab and how to use it?

  • How to show percentage and totals

  • Several example for advanced usage

  • Business use

  • Video Tutorial

  • Notebook

What is Pandas crosstab?

Pandas crosstab can be considered as pivot table equivalent ( from Excel or LibreOffice Calc). It shows summary as tabular representation based on several factors. The information can be presented as counts, percentage, sum, average or other statistical methods.

The official Pandas Documentation describe it as:

Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

In this example we are going to work with movies dataset: IMDB 5000 movie dataset. By using cross-tabulation we will try to answer to several question related to this dataset:

  • How many movies does actor have per country?
  • What is the percentage per actor and per country?
  • What is the total per country and per actor?
  • Average IMDB rating per actor and per country

Steps to use Pandas crosstab

Steps 1: Import Pandas and read data

The first steps in order to use pandas cross-tabulation method is to read your data and create DataFrame object. The next lines of code show how to create DataFrame from CSV file:

import pandas as pd
df = pd.read_csv("../csv/movie_metadata.csv")

The input for cross-tabulation is categorical data. In the event that there aren't overlapping indexes an empty DataFrame will be returned.

Step 2: Select data for the crosstab

Next you need to analyze your data and select the values which best represent your problem or question. Typical benefits of using crosstab or pivot tables are:

  • Data summary
  • Data aggregation
  • Grouping
  • Quick Reports
  • Data patterns

Usually I investigate Pandas DataFrame by getting several records from it:

df.head().T

or checking column names:

df.columns

Step 3: Create cross-tabulation table

The last step is to build your crosstab. The cross-tabulation includes many different options and parameters which make it really powerful tool for data analysis.

The most simple usage of Pandas crosstab is:

pd.crosstab(df2['director_name'], df2['country']

This will result in summary table like:

country Australia Canada New Zealand UK USA
director_name
Baz Luhrmann 1 0 0 0 0
Brett Ratner 0 1 0 0 0
David Yates 0 0 0 1 0
Gore Verbinski 0 0 0 0 2
Jon Favreau 0 0 0 1 0
Marc Forster 0 0 0 1 0
Peter Jackson 0 0 2 0 1
Sam Mendes 0 0 0 2 0

Pandas crosstabe can be used also for multiple columns for rows or columns. The syntax for multiple rows can be seen below:

pd.crosstab([df2['director_name'], df2['genres']], df2['country'])

This will add one more level or will do MultiIndex for your cross-tabulation.

country Australia Canada New Zealand UK USA
director_name genres
Baz Luhrmann Drama;Romance 1 0 0 0 0
Brett Ratner Action;Adventure;Fantasy;Sci-Fi;Thriller 0 1 0 0 0
David Yates Adventure;Family;Fantasy;Mystery 0 0 0 1 0
Gore Verbinski Action;Adventure;Fantasy 0 0 0 0 1
Action;Adventure;Western 0 0 0 0 1
Jon Favreau Adventure;Drama;Family;Fantasy 0 0 0 1 0
Marc Forster Action;Adventure 0 0 0 1 0
Peter Jackson Action;Adventure;Drama;Romance 0 0 1 0 0
Adventure;Fantasy 0 0 1 0 1
Sam Mendes Action;Adventure;Thriller 0 0 0 2 0

Step 4: Use percentage and totals

If the basic usage doesn't satisfy your needs you can go further by using percentage and/or add totals for rows and columns. In order to achieve this you can use two parameters:

  • margins : bool, default False - Add row/column margins (subtotals)
  • normalize : bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False - Normalize by dividing all values by the sum of values.

Example usage of it would be:

pd.crosstab(df2['director_name'], df2['country'], margins=True,  normalize='index')

Which will result in:

country Australia Canada New Zealand UK USA
director_name
Baz Luhrmann 1.000000 0.000000 0.000000 0.000000 0.000000
Brett Ratner 0.000000 1.000000 0.000000 0.000000 0.000000
David Yates 0.000000 0.000000 0.000000 1.000000 0.000000
Gore Verbinski 0.000000 0.000000 0.000000 0.000000 1.000000
Jon Favreau 0.000000 0.000000 0.000000 1.000000 0.000000
Marc Forster 0.000000 0.000000 0.000000 1.000000 0.000000
Peter Jackson 0.000000 0.000000 0.666667 0.000000 0.333333
Sam Mendes 0.000000 0.000000 0.000000 1.000000 0.000000
All 0.083333 0.083333 0.166667 0.416667 0.250000

Note: that combination of both will result in totals only for rows!

Step 5: Use values from another column and aggregation function

The final step is to use values from a different column and aggregation function like sum or average. How to achieve this is visible from the example below:

import numpy as np
pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.sum)

This will result in:

country Australia Canada New Zealand UK USA
director_name
Baz Luhrmann 7.3 NaN NaN NaN NaN
Brett Ratner NaN 6.8 NaN NaN NaN
David Yates NaN NaN NaN 7.5 NaN
Gore Verbinski NaN NaN NaN NaN 6.9
Jon Favreau NaN NaN NaN 7.8 NaN
Marc Forster NaN NaN NaN 6.7 NaN
Peter Jackson NaN NaN 7.35 NaN 7.9
Sam Mendes NaN NaN NaN 7.3 NaN

Crosstab can be simulated with groupby

If you don't like the idea of using crosstab then you can use combination of groupby and count (or other functions) to achieve similar result. The example below show how to simulate the basic usage:

cols = ['director_name', 'country']
df2.groupby(cols)[cols].count()

Personally I prefer to use crosstab because it's easier to work with and give much more options. Also show my intention what is the final goal of my code.

I recommend experimenting and trying different examples with crosstab and finally when this powerful technique is mastered to be used.

Resources