Skip to content Skip to sidebar Skip to footer

Pandas Affects Results Of Rapidfuzz Match?

I am hitting a wall with this. Rapidfuzz delivers different results for string score similarity if I run it within a pandas dataframe and if I run it by itself? Why the results for

Solution 1:

The error comes from the fact that you call the entire column when applying fuzz. If you do the following thing, which is to apply fuzz to the individual row, you get the same result:

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

alternatively, using .loc

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

The output in the dataframe is:

    Processed Client Name         Processed Aruvio Name  \
0  anhui jinhan clothing co ltd  anhui jinhan clothing co ltd   

                            Processed Client Address  \
0  high new technology development zones huainan ...   

        Processed Aruvio Address  Name Similarity  Address Similarity  \
0  industrial park of funan city        89.285714                 NaN   

   Address Similarity 2028.099174

and of fuzz.token_sort_ratio(address_a, address_b) is 28.099173553719012.

In other words, you need to specify which row you are intending on extracting strings from. I suppose your dataframe consists of several rows, which means you'll have to do this for each row:

for i in len(test_anui):
    test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']), 
    str(test_anui.loc[i,'Processed Aruvio Address']))

Post a Comment for "Pandas Affects Results Of Rapidfuzz Match?"