Pandas compare columns in two DataFrames
This is a short article for comparison between one or more columns of two DataFrames with Pandas and Python. Article covers 7 different examples and one typical error - trying to show many different problems and their solutions.
A Jupyter Notebook with all examples can be found: Pandas_compare_columns_in_two_Dataframes.ipynb
1. DataFrames data
Lets have this two small tables which represents our data. They are stored as csv files but separated with space ( often data that we need to check come in strange or bad format):
file1
name | type | value |
---|---|---|
Mike | a+ | 98 |
Jery | a- | 144 |
Tomy | b | 108 |
file2
type | low | high |
---|---|---|
a+ | 78 | 97 |
a- | 108 | 143 |
b | 108 | 150 |
------ | ----- | ------ |
0 | 143 | 108 |
Note: in some of the examples file2 will have 3 rows and in some will be with 4 in order to demonstrate important difference. First we will start with 3 rows and later one we will append one row to the DataFrame.
1. Read data into DataFrames
Usually this is the easiest step when you are working with Pandas. In this example data is read from two text files separated with spaces( this is the reason for using - sep="\s+"
; in case of commas you can remove the separator):
import pandas as pd
df1 = pd.read_csv('~/file1.csv',sep="\s+")
df2 = pd.read_csv('~/file2.csv',sep="\s+")
Now data is loaded into two separate DataFrames which we are going to compare. Method read_csv
has many options but default behavior is use first row as DataFrame column name and create automatic numeric index. This is important because if the index differ between the DataFrames comparison is not possible due to error shown below.
2. Compare the DataFrames with same number of rows
Pandas offers several different ways for comparison of DataFrames which highly depends on data which will be compared. In next examples we will demonstrate how to do this.
In this section we will assume that our DataFrames are with equal size of rows and also that indexes matched each other (they are the same and sorted). Otherwise error will be raised:
ValueError: Can only compare identically-labeled Series objects
This is very frequent Python error which can be confusing in some situations.
Possible solutions for this errors are(for one or both DataFrames):
df1.reset_index(inplace=True)
- recreate the index againdf1.sort_index(inplace=True)
- sort the index if it was changeddf1.set_index([pd.Index([0, 1, 2])], inplace=True)
- set completely new index
Check are two string columns equal from different DataFrames
If DataFrames have exactly the same index then they can be compared by using np.where
. This will check whether values from a column from the first DataFrame match exactly value in the column of the second:
import numpy as np
df1['low_value'] = np.where(df1.type == df2.type, 'True', 'False')
result:
| | name | type | value | low_value |
|------|------|-------|-----|-------|
| 0 | Mike | a+ | 98 | True |
| 1 | Jery | a- | 144 | True |
| 2 | Tomy | b | 108 | True |
Compare two numeric columns from different DataFrames
Similar behavior can be applied for numeric columns. In this case we can use more operators like: greater, greater and equal, lesser etc (they can be used with strings but might have strange behavior sometimes):
import numpy as np
df1['low_value'] = np.where(df1.value <= df2.low, 'True', 'False')
result:
name | type | value | low_value | |
---|---|---|---|---|
0 | Mike | a+ | 98 | False |
1 | Jery | a- | 144 | False |
2 | Tomy | b | 108 | True |
In this case we can see that only last row match completely. How np.where
compare work in Pandas:
- Go row by row
- for example row 0
- Check selected values: df1.value <= df2.low
- check 98 <= 97
- Return the result as Series of Boolean values
4. False, False, True
Compare one column from first against two from second DataFrame
Comparing more than one column is frequent operation and Numpy/Pandas make this very easy and intuitive operation. All you need to remember is the syntax for such situation - (condition1) & (condition2) | (condition3)
:
import numpy as np
df1['low_high_value'] = np.where((df1.value >= df2.low) & (df1.value <= df2.high), 'True', 'False')
result:
name | type | value | low_high_value | |
---|---|---|---|---|
0 | Mike | a+ | 98 | False |
1 | Jery | a- | 144 | False |
2 | Tomy | b | 108 | True |
the code above simulate method between of Pandas which is demonstrated below.
Another example of using complex conditions:
df1['enh2'] = pd.Series((df2.type.isin(df1.type)) & (df1.value != df2.low) | (df1.value + 1 == df2.high))
output:
True, True, False
Compare two columns from first against two from second
In this example is shown how to compare 2 vs 2 columns. In other words - we want to ensure that two columns has identical values and only then to compare 3rd and 4th column(in this case index should match again!):
import numpy as np
df1['low_high_value'] = np.where((df1.type == df2.type) & (df1.value <= df2.high), 'True', 'False')
result:
name | type | value | low_high_value | |
---|---|---|---|---|
0 | Mike | a+ | 98 | False |
1 | Jery | a- | 144 | False |
2 | Tomy | b | 108 | True |
Compare columns of 2 DataFrames without np.where
So far we demonstrated examples of using Numpy where
method. Pandas offers other ways of doing comparison. For example let say that you want to compare rows which match on df1.columnA to df2.columnB but compare df1.columnC against df2.columnD. Using only Pandas this can be done in two ways - first one is by getting data into Series and later join it to the original one:
df3 = [(df2.type.isin(df1.type)) & (df1.value.between(df2.low,df2.high,inclusive=True))]
df1.join(df3)
the output of which is shown below:
name | type | value | 0 | |
---|---|---|---|---|
0 | Mike | a+ | 98 | False |
1 | Jery | a- | 144 | False |
2 | Tomy | b | 108 | True |
Compare columns of two DataFrames and create Pandas Series
It's also possible to use direct assign operation to the original DataFrame and create new column - named 'enh1' in this case. For this purpose the result of the conditions should be passed to pd.Series
constructor.
df1['enh1'] = pd.Series((df2.type.isin(df1.type)) & (df1.value >= df2.low) & (df1.value <= df2.high))
result:
name | type | value | 0 | |
---|---|---|---|---|
0 | Mike | a+ | 98 | False |
1 | Jery | a- | 144 | False |
2 | Tomy | b | 108 | True |
And we get exactly the same result as before.
3. Compare the DataFrames with different number of rows
If you try the above examples for DataFrames with different number of rows (not equally sized in rows) then you will get an error:
ValueError: Can only compare identically-labeled Series objects
This error is one of typical Pandas errors. For this error there might be different reasons so have in mind that one of the reasons is comparison of not equally sized objects.
So first let add one more row for the second DataFrame:
df2 = df2.append({'type':'0', 'low':143, 'high':108}, ignore_index=True)
Now lets merge the DataFrames into a single one based on column type.
Note 1: Merge can be done also on index or on differently named columns
Note 2: Merge can be inner - return only matching rows or outer - return all rows even those without match. In case of no match NaN values are returned.
merged = df1.merge(df2,how='outer',left_on=['type'],right_on=["type"])
resulted dataframe is:
name | type | value | low | high | |
---|---|---|---|---|---|
0 | Mike | a+ | 98.0 | 78 | 97 |
1 | Jery | a- | 144.0 | 108 | 143 |
2 | Tomy | b | 108.0 | 108 | 150 |
3 | NaN | 0 | NaN | 143 | 108 |
Finally lets compare the data:
merged[(merged.value >= merged.low) & (merged.value <= merged.high)]
result:
name | type | value | low | high | |
---|---|---|---|---|---|
2 | Tomy | b | 108.0 | 108 | 150 |
Again the same result but this time comparison works fine for different number of rows!
Tips for comparing DataFrames
- be sure that data match between the columns - type( string vs int will not work as expected even if visually is OK)
- clean data always when possible
- for huge DataFrames - split them into smaller chunks and finally concatenate them by
result = pd.concat([df1, df2, df3])
- test smaller subset of your data prior working with the whole in order to ensure valid results
- in case of inconsistent results try different way or isolate non working data and examine it