Python Pandas Compare Two CSV files based on a Column

This article shows the python / pandas equivalent of SQL join. You can find how to compare two CSV files based on columns and output the difference using python and pandas. The advantage of pandas is the speed, the efficiency and that most of the work will be done for you by pandas:

  • reading the CSV files(or any other)
  • parsing the information into tabular form
  • comparing the columns
  • output the final result

Previous article about pandas: Pandas how to concatenate columns

So lets have this scenario - two CSV files like:

column1
test
test1
test2
test3
test4
test6

and

column1,column2,column3
test,person,file
test1,person1,file1
test2,person2,file2
test3,person3,file3
test4,person4,file4
test5,person5,file5

Our goals is to find all rows without a match from the first file in the second based on a given column.

import pandas as pd

f1 = pd.read_csv('C:\\user\\file1.csv)
f2 = pd.read_csv('C:\\user\\file2.csv')

print(f2[~f2.column1.isin(f1.column1)])

The result of this code will be:

  column1  column2 column3
5   test5  person5   file5

If you want to compare the other way around you can use:

print(f1[~f1.column1.isin(f2.column1)])

and the result will be:

  column1
5   test6

Lets make some notes about the code:

Depending on your CSV file you can need to change this line. More info about read_csv:

f2 = pd.read_csv('C:\\user\\file2.csv', sep=';')
f2 = pd.read_csv('C:\\user\\file2.csv')

By default the separator for method read_csv should be ',' so if you have anything different from it like ';' then you need to specify it. Otherwise your columns will be wrongly matched.

If you receive and error like:

Data-frame Object has no Attribute

Then you can check your columns for the dataframe by:

print(f2.columns)

result:

Index(['column1', 'column2', 'column3'], dtype='object')

or checking the whole dataframes by:

print(f2)

result:

  column1  column2 column3
0    test   person    file
1   test1  person1   file1
2   test2  person2   file2
3   test3  person3   file3
4   test4  person4   file4
5   test5  person5   file5

And finally the explanation for the final line which is doing the comparison:

print(f1[~f1.column1.isin(f2.column1)])

Some info about the functions and operators:

  • pandas.isin - Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
  • python tilde ~ - Bitwise not, inversing boolean - False to True and True to False
  • finally we return only the rows without a match.

If you want to simulate SQL join with pandas then you can try this code:

import pandas as pd

f1 = pd.read_csv('C:\\user\\file1.csv)
f2 = pd.read_csv('C:\\user\\file2.csv')

f3 = pd.concat([f2, f1[~f1.column1.isin(f2.column1)]])
print(f3)

the result will be:

  column1  column2 column3
0    test   person    file
1   test1  person1   file1
2   test2  person2   file2
3   test3  person3   file3
4   test4  person4   file4
5   test5  person5   file5
5   test6      NaN     NaN

everything from the first file plus the new ones with NaNs for the non matching columns.

Related Article