Pandas Extract Regex Allowing Mismatches
Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one: strict_pattern = r'^(?PACGAG)(?P.{
Solution 1:
Until pandas
is compiled with the regex
library, you can't use these features in .extract
.
You will probably have to rely on .apply
with a custom method:
import regex
import pandas as pd
test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})
lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")
empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])
defextract_regex(seq):
m = lax_pattern.search(seq)
if m:
return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) # list(m.groupdict().values())else:
return empty_val
test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)
Output:
>>>test_df
R1 pre_spacer UMI post_spacer
0 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
1 AAAAGGGA
Post a Comment for "Pandas Extract Regex Allowing Mismatches"