Pandas compare columns in two DataFrames

This is a short article for comparison between one or more columns of two DataFrames with Pandas and Python. Article covers 7 different examples and one typical error - trying to show many different problems and their solutions.

A Jupyter Notebook with all examples can be found: Pandas_compare_columns_in_two_Dataframes.ipynb

1. DataFrames data

Lets have this two small tables which represents our data. They are stored as csv files but separated with space ( often data that we need to check come in strange or bad format):

file1

name type value
Mike a+ 98
Jery a- 144
Tomy b 108

file2

type low high
a+ 78 97
a- 108 143
b 108 150
------ ----- ------
0 143 108

Note: in some of the examples file2 will have 3 rows and in some will be with 4 in order to demonstrate important difference. First we will start with 3 rows and later one we will append one row to the DataFrame.

1. Read data into DataFrames

Usually this is the easiest step when you are working with Pandas. In this example data is read from two text files separated with spaces( this is the reason for using - sep="\s+" ; in case of commas you can remove the separator):

import pandas as pd
df1 = pd.read_csv('~/file1.csv',sep="\s+")
df2 = pd.read_csv('~/file2.csv',sep="\s+")

Now data is loaded into two separate DataFrames which we are going to compare. Method read_csv has many options but default behavior is use first row as DataFrame column name and create automatic numeric index. This is important because if the index differ between the DataFrames comparison is not possible due to error shown below.

2. Compare the DataFrames with same number of rows

Pandas offers several different ways for comparison of DataFrames which highly depends on data which will be compared. In next examples we will demonstrate how to do this.

In this section we will assume that our DataFrames are with equal size of rows and also that indexes matched each other (they are the same and sorted). Otherwise error will be raised:

ValueError: Can only compare identically-labeled Series objects

This is very frequent Python error which can be confusing in some situations.

Possible solutions for this errors are(for one or both DataFrames):

  • df1.reset_index(inplace=True) - recreate the index again
  • df1.sort_index(inplace=True) - sort the index if it was changed
  • df1.set_index([pd.Index([0, 1, 2])], inplace=True) - set completely new index

Check are two string columns equal from different DataFrames

If DataFrames have exactly the same index then they can be compared by using np.where. This will check whether values from a column from the first DataFrame match exactly value in the column of the second:

import numpy as np
df1['low_value'] = np.where(df1.type == df2.type, 'True', 'False')

result:

|      | name | type | value | low_value    |
|------|------|-------|-----|-------|
| 0    | Mike | a+    | 98  | True |
| 1    | Jery | a-    | 144 | True |
| 2    | Tomy | b     | 108 | True  |

Compare two numeric columns from different DataFrames

Similar behavior can be applied for numeric columns. In this case we can use more operators like: greater, greater and equal, lesser etc (they can be used with strings but might have strange behavior sometimes):

import numpy as np
df1['low_value'] = np.where(df1.value <= df2.low, 'True', 'False')

result:

name type value low_value
0 Mike a+ 98 False
1 Jery a- 144 False
2 Tomy b 108 True

In this case we can see that only last row match completely. How np.where compare work in Pandas:

  1. Go row by row
    • for example row 0
  2. Check selected values: df1.value <= df2.low
    • check 98 <= 97
  3. Return the result as Series of Boolean values
    4. False, False, True

Compare one column from first against two from second DataFrame

Comparing more than one column is frequent operation and Numpy/Pandas make this very easy and intuitive operation. All you need to remember is the syntax for such situation - (condition1) & (condition2) | (condition3):

import numpy as np
df1['low_high_value'] = np.where((df1.value >= df2.low) & (df1.value <= df2.high), 'True', 'False')

result:

name type value low_high_value
0 Mike a+ 98 False
1 Jery a- 144 False
2 Tomy b 108 True

the code above simulate method between of Pandas which is demonstrated below.

Another example of using complex conditions:

df1['enh2'] = pd.Series((df2.type.isin(df1.type)) & (df1.value != df2.low)  | (df1.value + 1 == df2.high))

output:

True, True, False

Compare two columns from first against two from second

In this example is shown how to compare 2 vs 2 columns. In other words - we want to ensure that two columns has identical values and only then to compare 3rd and 4th column(in this case index should match again!):

import numpy as np
df1['low_high_value'] = np.where((df1.type == df2.type) & (df1.value <= df2.high), 'True', 'False')

result:

name type value low_high_value
0 Mike a+ 98 False
1 Jery a- 144 False
2 Tomy b 108 True

Compare columns of 2 DataFrames without np.where

So far we demonstrated examples of using Numpy where method. Pandas offers other ways of doing comparison. For example let say that you want to compare rows which match on df1.columnA to df2.columnB but compare df1.columnC against df2.columnD. Using only Pandas this can be done in two ways - first one is by getting data into Series and later join it to the original one:

df3 = [(df2.type.isin(df1.type)) & (df1.value.between(df2.low,df2.high,inclusive=True))]
df1.join(df3)

the output of which is shown below:

name type value 0
0 Mike a+ 98 False
1 Jery a- 144 False
2 Tomy b 108 True

Compare columns of two DataFrames and create Pandas Series

It's also possible to use direct assign operation to the original DataFrame and create new column - named 'enh1' in this case. For this purpose the result of the conditions should be passed to pd.Series constructor.

df1['enh1'] = pd.Series((df2.type.isin(df1.type)) & (df1.value >= df2.low) & (df1.value <= df2.high))

result:

name type value 0
0 Mike a+ 98 False
1 Jery a- 144 False
2 Tomy b 108 True

And we get exactly the same result as before.

3. Compare the DataFrames with different number of rows

If you try the above examples for DataFrames with different number of rows (not equally sized in rows) then you will get an error:

ValueError: Can only compare identically-labeled Series objects

This error is one of typical Pandas errors. For this error there might be different reasons so have in mind that one of the reasons is comparison of not equally sized objects.

So first let add one more row for the second DataFrame:

df2 = df2.append({'type':'0', 'low':143, 'high':108}, ignore_index=True)

Now lets merge the DataFrames into a single one based on column type.

Note 1: Merge can be done also on index or on differently named columns
Note 2: Merge can be inner - return only matching rows or outer - return all rows even those without match. In case of no match NaN values are returned.

merged = df1.merge(df2,how='outer',left_on=['type'],right_on=["type"])

resulted dataframe is:

name type value low high
0 Mike a+ 98.0 78 97
1 Jery a- 144.0 108 143
2 Tomy b 108.0 108 150
3 NaN 0 NaN 143 108

Finally lets compare the data:

merged[(merged.value >= merged.low) & (merged.value <= merged.high)]

result:

name type value low high
2 Tomy b 108.0 108 150

Again the same result but this time comparison works fine for different number of rows!

Tips for comparing DataFrames

  • be sure that data match between the columns - type( string vs int will not work as expected even if visually is OK)
  • clean data always when possible
  • for huge DataFrames - split them into smaller chunks and finally concatenate them by result = pd.concat([df1, df2, df3])
  • test smaller subset of your data prior working with the whole in order to ensure valid results
  • in case of inconsistent results try different way or isolate non working data and examine it

Related Article