Skip to content Skip to sidebar Skip to footer

Pandas Extract Regex Allowing Mismatches

Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one: strict_pattern = r'^(?PACGAG)(?P.{

Solution 1:

Until pandas is compiled with the regex library, you can't use these features in .extract.

You will probably have to rely on .apply with a custom method:

import regex
import pandas as pd

test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})

lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")

empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])

defextract_regex(seq):
    m = lax_pattern.search(seq)
    if m:
        return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) #  list(m.groupdict().values())else:
        return empty_val


test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)

Output:

>>>test_df
                               R1 pre_spacer           UMI post_spacer
0  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG      ACGAG  TTTTCGTATTTT    TGGAGTCT
1                        AAAAGGGA                                     

Post a Comment for "Pandas Extract Regex Allowing Mismatches"