Skip to content Skip to sidebar Skip to footer

Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns): df1: PRODUCT_ID PRODUCT_DESCRIPT

Solution 1:

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:
    for j in df3.columns:
        vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
        vj = df2.get_value(j, 'PROD_DESCRIPTION')
        df3.set_value(
            i, j, fuzz.ratio(vi, vj))

print(df3)

    01234506315242334271268419215232218313312353431031351041424295232104212515282149855

Set a threshold for acceptable distance. I set 50 Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

enter image description here

Solution 2:

You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

d = {
    'df1_id': [],
    'df1_prod_desc': [],
    'df2_id': [],
    'df2_prod_desc': [],
    'fuzzywuzzy_sim': []
}
for_, df1_row in df1.iterrows():
    for_, df2_row in df2.iterrows():
        d['df1_id'] = df1_row['PRODUCT_ID']
        ...
df3 = pd.DataFrame.from_dict(d)

Solution 3:

I don't have enough reputation to be able to comment on answer from @piRSquared. Hence this answer.

  • The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
  • Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
  • Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)

A million thanks to @piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.

Post a Comment for "Searching One Python Dataframe / Dictionary For Fuzzy Matches In Another Dataframe"