Skip to content Skip to sidebar Skip to footer

Understanding Why Drop.duplicates() Is Not Working

Suppose I have a 2-row pandas dataframe that I acquired by subsetting a larger dataframe. TransID rev offer qs lt chan 212 RTSO118981094

Solution 1:

Well in this case the problem was the mixed types you had. A common way to investigate your data is to export it with, e.g. to_dict()

df.to_dict()

Also consider this example:

import pandas as pd

df1 = pd.DataFrame({
    'a': [3,3],
    'b': ["d382","d382"]
})

df2 = pd.DataFrame({
    'a': ['3',3],
    'b': ["d382","d382"]
})

df3 = pd.DataFrame({
    'a': ['3','3'],
    'b': ["d382","d382"]
})

print(df1.dtypes) # <-- Use dtypes to reveal what data types your columns holdprint(df2.dtypes) # <-- Use dtypes to reveal what data types your columns holdprint(df3.dtypes) # <-- Use dtypes to reveal what data types your columns hold

Returns:

df1df2df3aint64aobjectaobjectbobjectbobjectbobjectdtype: objectdtype: objectdtype: object

Further exploring: In pandas the object type can hold different types. That can create a tricky situation where we mix integers, lists, classes... you name it.

Let us now select only those columns and use applymap(type) to find out the type in each cell. Looking at the above examples the erraneous dataframe is df2 which in column a hold different types of object.

print(df1.select_dtypes(include=['object']).applymap(type))
print(df2.select_dtypes(include=['object']).applymap(type))
print(df3.select_dtypes(include=['object']).applymap(type))

               b
0  <class'str'>
1  <class'str'>
               a              b
0  <class'str'>  <class'str'>       # <--- look at column a1  <class'int'>  <class'str'>       # <--- it has mixed types
               a              b
0  <class'str'>  <class'str'>
1  <class'str'>  <class'str'>

And finally, let us now create a function that goes through all object columns and check if everything is correct. This is defined by the length of the set of values in each column. In a "correct" column all elements should be of same type:

defcheck_obj_columns(dfx):
    tdf = dfx.select_dtypes(include=['object']).applymap(type)
    for col in tdf:
        iflen(set(tdf[col].values)) > 1:
            print("Column {} has mixed object types.".format(col))

check_obj_columns(df1) # Returns nothing
check_obj_columns(df2) # Returns: Column a has mixed object types.
check_obj_columns(df3) # Returns nothing

This means that the df2 has a object column a with mixed types.


In your case:

TransID     object
rev        float64
offer       object
qs          object   # <-- this element here is an objectif you got mixed types
lt          object
chan        object
dtype: object

Post a Comment for "Understanding Why Drop.duplicates() Is Not Working"