Skip to content Skip to sidebar Skip to footer

Remove Duplicated Seq Name Pandas

I actually have one dataframe, here is an exemple: cluster seq_sp1 seq_sp2 1 seq20 seq56 1 seq56 seq20 2 seq3 seq5 3

Solution 1:

I think need numpy.sort with drop_duplicates - return sorted rows:

df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
   cluster seq_sp1 seq_sp2
01   seq20   seq56
22    seq3    seq5
33    seq5    seq9
43    seq4    seq7

Or use DataFrame.duplicated for mask with inverted mask by ~ nd filtering by boolean indexing - original not sorted values in output:

mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated()
df = df[~mask]

print (df)
   cluster seq_sp1 seq_sp2
0        1   seq20   seq56
2        2    seq3    seq5
3        3    seq9    seq5
4        3    seq7    seq4

EDIT:

I test it with new data:

df = df[['qseqid','sseqid']]print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1
19  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0035_1
20  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0042_1
22  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0035_1
23  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0042_1

df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])

print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1

mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13True14True16True17True19False20False22False23False
dtype: bool

df = df[~mask]
print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1

Solution 2:

You can try this:

#sorting rows and joining as stringdf["seq_sorted"] = df.apply(lambda row: ",".join(x for x in sorted((row.seq_sp1,  row.seq_sp2))), axis=1)

#droping duplicatesdf = df.drop_duplicates(subset="seq_sorted").drop(["seq_sorted"], axis=1)

Solution 3:

For example:

df_set = df.apply(lambda x: str(sorted(set(x))), 1)

In: df[~df_set.duplicated()]
Out: 
        seq_sp1 seq_sp2
cluster                
1         seq20   seq56
2          seq3    seq5
3          seq9    seq5
3          seq7    seq4

Solution 4:

You can use pd.DataFrame.apply to apply sorted on axis=1. Then use pd.Series.duplicated to drop duplicates.

dups = df[['seq_sp1', 'seq_sp2']].apply(sorted, axis=1).duplicated()
res = df[~dups]

print(res)

   cluster seq_sp1 seq_sp2
01   seq20   seq56
22    seq3    seq5
33    seq9    seq5
43    seq7    seq4

Post a Comment for "Remove Duplicated Seq Name Pandas"