Python Regex: Alternation For Sets Of Words

June 27, 2023 Post a Comment

We know \ba\b|\bthe\b will match either word 'a' or 'the' I want to build a regex expression to match a pattern like a/the/one reason/reasons for/of Which means I want to match a

Solution 1:

You need to use a capture group to refuse of mixing the OR's (|)

(\ba\b|\bthe\b|\bone\b) (\breason\b|reasons\b) (\bfor\b|\bof\b)

And then as a more elegant way you can put the word boundaries around the groups.Also note that when you are using space in your regex around the words there is no need to use word boundary.And for reasons and reason you can make the last s optional with ?. And note that if you don't want to match your words as a separate groups you can makes your groups to a none capture group by :?.

\b(?:a|the|one) reasons? (?:for|of)\b

Or use capture group if you want the words in group :

\b(a|the|one) (reasons?) (for|of)\b

Solution 2:

The regular expression modifier A|B means that "if either A or B matches, then the whole thing matches". So in your case, the resulting regular expression matches if/where any of the following 5 regular expressions match:

\ba\b
\bthe\b
\bone\b \breason\b
reasons\b \bfor\b
\bof\b

To limit the extent to which | applies, use the non-capturing grouping for this, that is (?:something|something else). Also, for having an optional s at the end of reason you do not need to use alteration; this is exactly equal to reasons?.

Thus we get the regular expression \b(?:a|the|one) reasons? (?:for|of)\b.

Note that you do not need to use the word boundary operators \b within the regular expression, only at the beginning and end (otherwise it would match something like everyone reasons forever).

Solution 3:

An interesting feature of the regex module is the named list. With it, you don't have to include several alternatives separated by | in a non capturing group. You only need to define the list before and to refer to it in the pattern by its name. Example:

import regex

words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]

pattern = r'\m \L<word1> \s+ \L<word2> \s+ \L<word3> \M'
p = regex.compile(pattern, regex.X, word1=words[0], word2=words[1], word3=words[2])

s = 'the reasons for'print(p.search(s))

Even if this feature isn't essential, It improves the readability.

You can achieve something similar with the re module if you join items with | before:

import re

words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]

words = ['|'.join(x) for x in words]

pattern = r'\b ({}) \s+ ({}) \s+ ({}) \b'.format(*words)

p = re.compile(pattern, re.X)

Solution 4:

Use parentheses for grouping:

'\b(a|the|one) reason(|s) (for|of)\b'

I left the sentence-internal \b's out since the spaces imply them: A space following a letter is always a word boundary. In general you should put the \b outside the alternatives; it's shorter and more readable.

If it matters, you can use "non-capturing groups" in all modern regexp engines: Use (?:stuff) instead of (stuff). But if it doesn't matter for your uses, or if you need to know which of the word alternatives are actually present, then go with simple parens.

Solution 5:

As I understand you want some regex like this:

(?:a|the|one)\s+(?:reason|reasons)\s+(?:for|of)

It's so simple, just combine them by using groups.

see: DEMO

Note Your requirement above, its sound is not so strict for me, in case that you want to modify something by yourself, let's consider the explanation below

Explanation

(?:abc|ijk|xyz)

Any word abc, ijk or xyz which grouped by non-capture group (?:...) means this word will not capture to regex variable $1, $2, $3, ....

\s+

This is word delimiter which here I set it as any spaces, + stands for 1 or more.

Python Library