Skip to main content

35.3.5 Problems with Regular Expressions

The Emacs regexp implementation, like many of its kind, is generally robust but occasionally causes trouble in either of two ways: matching may run out of internal stack space and signal an error, and it can take a long time to complete. The advice below will make these symptoms less likely and help alleviate problems that do arise.

  • Anchor regexps at the beginning of a line, string or buffer using zero-width assertions ('^' and \`). This takes advantage of fast paths in the implementation and can avoid futile matching attempts. Other zero-width assertions may also bring benefits by causing a match to fail early.

  • Avoid or-patterns in favor of bracket expressions: write '[ab]' instead of 'a\|b'. Recall that '\s-' and '\sw' are equivalent to '[[:space:]]' and '[[:word:]]', respectively, most of the time.

  • Since the last branch of an or-pattern does not add a backtrack point on the stack, consider putting the most likely matched pattern last. For example, '^\(?:a\|.b\)*c' will run out of stack if trying to match a very long string of 'a's, but the equivalent '^\(?:.b\|a\)*c' will not.

    (It is a trade-off: successfully matched or-patterns run faster with the most frequently matched pattern first.)

  • Try to ensure that any part of the text can only match in a single way. For example, 'a*a*' will match the same set of strings as 'a*', but the former can do so in many ways and will therefore cause slow backtracking if the match fails later on. Make or-pattern branches mutually exclusive if possible, so that matching will not go far into more than one branch before failing.

    Be especially careful with nested repetitions: they can easily result in very slow matching in the presence of ambiguities. For example, '\(?:a*b*\)+c' will take a long time attempting to match even a moderately long string of 'a's before failing. The equivalent '\(?:a\|b\)*c' is much faster, and '[ab]*c' better still.

  • Don't use capturing groups unless they are really needed; that is, use '\(?:...\)' instead of '\(...\)' for bracketing purposes.

  • Consider using rx (see The rx Structured Regexp Notation); it can optimize some or-patterns automatically and will never introduce capturing groups unless explicitly requested.

If you run into regexp stack overflow despite following the above advice, don't be afraid of performing the matching in multiple function calls, each using a simpler regexp where backtracking can more easily be contained.

Function: re--describe-compiled regexp &optional raw

To help diagnose problems in your regexps or in the regexp engine itself, this function returns a string describing the compiled form of regexp. To make sense of it, it can be necessary to read at least the description of the re_opcode_t type in the src/regex-emacs.c file in Emacs's source code.

It is currently able to give a meaningful description only if Emacs was compiled with --enable-checking.