Pandas: Check If Value of Column Is Contained in Another Column in the Same Row

In this guide, I'll show you how to find if value in one string or list column is contained in another string column in the same row. In the article are present 3 different ways to achieve the same result.

These examples can be used to find a relationship between two columns in a DataFrame.

Dataset: IMDB 5000 Movie Dataset

Step 1: Check If String Column Contains Substring of Another with Function

The first solution is the easiest one to understand and work it. It is easy for customization and maintenance.

To start, we will define a function which will be used to perform the check. Then the function will be invoked by using apply:

def find_value_column(row):
    return row.country in row.movie_title

df[df.apply(find_value_column, axis=1)][['movie_title', 'country']]

which will result in:

movie_title country
196 Australia Australia
2504 McFarland, USA USA

What will happen if there are NaN values in one of the columns?

def find_value_column(row):
    return row.movie_title.lower().strip() in row.plot_keywords

df[df.apply(find_value_column, axis=1)][['movie_title', 'plot_keywords']].head(10)

Then error will be raised:

TypeError: ("argument of type 'float' is not iterable", 'occurred at index 4')

There is easy solution for this error - convert the column NaN values to empty list values thus:

for row in df.loc[df.plot_keywords.isnull(), 'plot_keywords'].index:
    df.at[row, 'plot_keywords'] = []

now we get proper results like:

python_online_editor_2020

Step 2: Check If Column Contains Another Column with Lambda

The second solution is similar to the first - in terms of performance and how it is working - one but this time we are going to use lambda. The advantage of this way is - shortness:

df[df.apply(lambda x: x.country in x.movie_title, axis=1)][['movie_title', 'country']]
movie_title country
196 Australia Australia
2504 McFarland, USA USA

A possible disadvantage of this method is the need to know how apply and lambda works and how to deal with errors if any.

For example this piece of code similar but will result in error like:

df.apply(lambda row: df.country in df.movie_title, axis=1)

output:

TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')

It may be obvious for some people but a novice will have hard time to understand what is going on.

Step 3: Fastest Way to Check If One Column Contains Another

This solution is the fastest one. It is short and easy to understand. It includes zip on the selected data. In this case data can be used from two different DataFrames.

df[[x[0] in x[1] for x in zip(df['country'], df['movie_title'])]][['movie_title', 'country']]
movie_title country
196 Australia Australia
2504 McFarland, USA USA

again if the column contains NaN values they should be filled with default values like:

df['country'].fillna('Uknown', inplace=True)

Step 4: For Loop and df.iterrows() Version

The final solution is the most simple one and it's suitable for beginners. Iterates over the rows one by one and perform the check. This solution is the slowest one:

for i, row in df.iterrows():
    if row.country in row.movie_title:
        print(row.country, row.movie_title)

result:

Australia Australia 
USA McFarland, USA 

Bonus Step: Check If List Column Contains Substring of Another with Function

Now lets assume that we would like to check if any value from column plot_keywords:

  • avatar|future|marine|native|paraplegic
  • bomb|espionage|sequel|spy|terrorist

is part of the movie title.

In this case we can:

  1. split the string column into a list
  2. convert NaN values to empty list
  3. perform search for each word in the list against the title
df['keywords'] = df.plot_keywords.str.split('|')

Now we have two different options:

Skip the conversion of NaN but check them in the function:

def find_value_column(row):
    if isinstance(row['keywords'], list):
        for keyword in row['keywords']:
            return keyword in row.movie_title.lower()
    else:
        return False

df[df.apply(find_value_column, axis=1)][['movie_title', 'keywords']].head()

or

Convert all the NaN values:

df['keywords'] = df['keywords'].apply(lambda d: d if isinstance(d, list) else [])

def find_value_column(row):
    for keyword in row['keywords']:
        return keyword in row.movie_title.lower()
    return False

df[df.apply(find_value_column, axis=1)][['movie_title', 'keywords']].head()

results in:

pandas_search_list_column_in_another

Performance check

Below you can find results of all solutions and compare their speed:

  • 10 loops, best of 3: 152 ms per loop
  • 10 loops, best of 3: 153 ms per loop
  • 1000 loops, best of 3: 1.69 ms per loop
  • 1 loop, best of 3: 580 ms per loop

So the one in step 3 - zip one - is the fastest and outperform the others by magnitude. In my everyday work I prefer to use 2 and 3(for high volume data) in most cases and only in some case 1 - when there is complex logic to be implemented.

Share Tweet Send
0 Comments
Loading...