Python Pandas Compare Two CSV files based on a Column
This article shows the python / pandas equivalent of SQL join. You can find how to compare two CSV files based on columns and output the difference using python and pandas. The advantage of pandas is the speed, the efficiency and that most of the work will be done for you by pandas:
- reading the CSV files(or any other)
- parsing the information into tabular form
- comparing the columns
- output the final result
Previous article about pandas: Pandas how to concatenate columns
So lets have this scenario - two CSV files like:
column1
test
test1
test2
test3
test4
test6
and
column1,column2,column3
test,person,file
test1,person1,file1
test2,person2,file2
test3,person3,file3
test4,person4,file4
test5,person5,file5
Our goals is to find all rows without a match from the first file in the second based on a given column.
import pandas as pd
f1 = pd.read_csv('C:\\user\\file1.csv)
f2 = pd.read_csv('C:\\user\\file2.csv')
print(f2[~f2.column1.isin(f1.column1)])
The result of this code will be:
column1 column2 column3
5 test5 person5 file5
If you want to compare the other way around you can use:
print(f1[~f1.column1.isin(f2.column1)])
and the result will be:
column1
5 test6
Lets make some notes about the code:
Depending on your CSV file you can need to change this line. More info about read_csv:
f2 = pd.read_csv('C:\\user\\file2.csv', sep=';')
f2 = pd.read_csv('C:\\user\\file2.csv')
By default the separator for method read_csv should be ',' so if you have anything different from it like ';' then you need to specify it. Otherwise your columns will be wrongly matched.
If you receive and error like:
Data-frame Object has no Attribute
Then you can check your columns for the dataframe by:
print(f2.columns)
result:
Index(['column1', 'column2', 'column3'], dtype='object')
or checking the whole dataframes by:
print(f2)
result:
column1 column2 column3
0 test person file
1 test1 person1 file1
2 test2 person2 file2
3 test3 person3 file3
4 test4 person4 file4
5 test5 person5 file5
And finally the explanation for the final line which is doing the comparison:
print(f1[~f1.column1.isin(f2.column1)])
Some info about the functions and operators:
- pandas.isin - Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
- python tilde ~ - Bitwise not, inversing boolean - False to True and True to False
- finally we return only the rows without a match.
If you want to simulate SQL join with pandas then you can try this code:
import pandas as pd
f1 = pd.read_csv('C:\\user\\file1.csv)
f2 = pd.read_csv('C:\\user\\file2.csv')
f3 = pd.concat([f2, f1[~f1.column1.isin(f2.column1)]])
print(f3)
the result will be:
column1 column2 column3
0 test person file
1 test1 person1 file1
2 test2 person2 file2
3 test3 person3 file3
4 test4 person4 file4
5 test5 person5 file5
5 test6 NaN NaN
everything from the first file plus the new ones with NaNs for the non matching columns.