Remove Duplicated Seq Name Pandas
I actually have one dataframe, here is an exemple: cluster seq_sp1 seq_sp2 1 seq20 seq56 1 seq56 seq20 2 seq3 seq5 3
Solution 1:
I think need numpy.sort
with drop_duplicates
- return sorted rows:
df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
cluster seq_sp1 seq_sp2
01 seq20 seq56
22 seq3 seq5
33 seq5 seq9
43 seq4 seq7
Or use DataFrame.duplicated
for mask with inverted mask by ~
nd filtering by boolean indexing
- original not sorted values in output:
mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated()
df = df[~mask]
print (df)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq9 seq5
4 3 seq7 seq4
EDIT:
I test it with new data:
df = df[['qseqid','sseqid']]print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
19 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1
20 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1
22 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1
23 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1
df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13True14True16True17True19False20False22False23False
dtype: bool
df = df[~mask]
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
Solution 2:
You can try this:
#sorting rows and joining as stringdf["seq_sorted"] = df.apply(lambda row: ",".join(x for x in sorted((row.seq_sp1, row.seq_sp2))), axis=1)
#droping duplicatesdf = df.drop_duplicates(subset="seq_sorted").drop(["seq_sorted"], axis=1)
Solution 3:
For example:
df_set = df.apply(lambda x: str(sorted(set(x))), 1)
In: df[~df_set.duplicated()]
Out:
seq_sp1 seq_sp2
cluster
1 seq20 seq56
2 seq3 seq5
3 seq9 seq5
3 seq7 seq4
Solution 4:
You can use pd.DataFrame.apply
to apply sorted
on axis=1
. Then use pd.Series.duplicated
to drop duplicates.
dups = df[['seq_sp1', 'seq_sp2']].apply(sorted, axis=1).duplicated()
res = df[~dups]
print(res)
cluster seq_sp1 seq_sp2
01 seq20 seq56
22 seq3 seq5
33 seq9 seq5
43 seq7 seq4
Post a Comment for "Remove Duplicated Seq Name Pandas"