Pandas Affects Results Of Rapidfuzz Match?
I am hitting a wall with this. Rapidfuzz delivers different results for string score similarity if I run it within a pandas dataframe and if I run it by itself? Why the results for
Solution 1:
The error comes from the fact that you call the entire column when applying fuzz. If you do the following thing, which is to apply fuzz to the individual row, you get the same result:
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))
alternatively, using .loc
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))
The output in the dataframe is:
Processed Client Name Processed Aruvio Name \
0 anhui jinhan clothing co ltd anhui jinhan clothing co ltd
Processed Client Address \
0 high new technology development zones huainan ...
Processed Aruvio Address Name Similarity Address Similarity \
0 industrial park of funan city 89.285714 NaN
Address Similarity 2028.099174
and of fuzz.token_sort_ratio(address_a, address_b)
is 28.099173553719012
.
In other words, you need to specify which row you are intending on extracting strings from. I suppose your dataframe consists of several rows, which means you'll have to do this for each row:
for i in len(test_anui):
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']),
str(test_anui.loc[i,'Processed Aruvio Address']))
Post a Comment for "Pandas Affects Results Of Rapidfuzz Match?"