Understanding Why Drop.duplicates() Is Not Working
Solution 1:
Well in this case the problem was the mixed types you had. A common way to investigate your data is to export it with, e.g. to_dict()
df.to_dict()
Also consider this example:
import pandas as pd
df1 = pd.DataFrame({
'a': [3,3],
'b': ["d382","d382"]
})
df2 = pd.DataFrame({
'a': ['3',3],
'b': ["d382","d382"]
})
df3 = pd.DataFrame({
'a': ['3','3'],
'b': ["d382","d382"]
})
print(df1.dtypes) # <-- Use dtypes to reveal what data types your columns holdprint(df2.dtypes) # <-- Use dtypes to reveal what data types your columns holdprint(df3.dtypes) # <-- Use dtypes to reveal what data types your columns hold
Returns:
df1df2df3aint64aobjectaobjectbobjectbobjectbobjectdtype: objectdtype: objectdtype: object
Further exploring: In pandas the object type can hold different types. That can create a tricky situation where we mix integers, lists, classes... you name it.
Let us now select only those columns and use applymap(type)
to find out the type in each cell. Looking at the above examples the erraneous dataframe is df2 which in column a
hold different types of object.
print(df1.select_dtypes(include=['object']).applymap(type))
print(df2.select_dtypes(include=['object']).applymap(type))
print(df3.select_dtypes(include=['object']).applymap(type))
b
0 <class'str'>
1 <class'str'>
a b
0 <class'str'> <class'str'> # <--- look at column a1 <class'int'> <class'str'> # <--- it has mixed types
a b
0 <class'str'> <class'str'>
1 <class'str'> <class'str'>
And finally, let us now create a function that goes through all object columns and check if everything is correct. This is defined by the length of the set of values in each column. In a "correct" column all elements should be of same type:
defcheck_obj_columns(dfx):
tdf = dfx.select_dtypes(include=['object']).applymap(type)
for col in tdf:
iflen(set(tdf[col].values)) > 1:
print("Column {} has mixed object types.".format(col))
check_obj_columns(df1) # Returns nothing
check_obj_columns(df2) # Returns: Column a has mixed object types.
check_obj_columns(df3) # Returns nothing
This means that the df2 has a object column a
with mixed types.
In your case:
TransID object
rev float64
offer object
qs object # <-- this element here is an objectif you got mixed types
lt object
chan object
dtype: object
Post a Comment for "Understanding Why Drop.duplicates() Is Not Working"