15.7 Backslash in Regular Expressions
For the most part, ‘\
’ followed by any character matches only that character. However, there are several exceptions: two-character sequences starting with ‘\
’ that have special meanings. The second character in the sequence is always an ordinary character when used on its own. Here is a table of ‘\
’ constructs.
\|
specifies an alternative. Two regular expressions a
and b
with ‘\|
’ in between form an expression that matches some text if either a
matches it or b
matches it. It works by trying to match a
, and if that fails, by trying to match b
.
Thus, ‘foo\|bar
’ matches either ‘foo
’ or ‘bar
’ but no other string.
‘\|
’ applies to the largest possible surrounding expressions. Only a surrounding ‘\( … \)
’ grouping can limit the grouping power of ‘\|
’.
Full backtracking capability exists to handle multiple uses of ‘\|
’.
\( … \)
is a grouping construct that serves three purposes:
- To enclose a set of ‘
\|
’ alternatives for other operations. Thus, ‘\(foo\|bar\)x
’ matches either ‘foox
’ or ‘barx
’. - To enclose a complicated expression for the postfix operators ‘
*
’, ‘+
’ and ‘?
’ to operate on. Thus, ‘ba\(na\)*
’ matches ‘bananana
’, etc., with any (zero or more) number of ‘na
’ strings. - To record a matched substring for future reference.
This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature that is assigned as a second meaning to the same ‘\( … \)
’ construct. In practice there is usually no conflict between the two meanings; when there is a conflict, you can use a shy group, described below.
\(?: … \)
specifies a shy group that does not record the matched substring; you can’t refer back to it with ‘\d
’ (see below). This is useful in mechanically combining regular expressions, so that you can add groups for syntactic purposes without interfering with the numbering of the groups that are meant to be referred to.
\d
matches the same text that matched the d
th occurrence of a ‘\( … \)
’ construct. This is called a back reference.
After the end of a ‘\( … \)
’ construct, the matcher remembers the beginning and end of the text matched by that construct. Then, later on in the regular expression, you can use ‘\
’ followed by the digit d
to mean “match the same text matched the d
th time by the ‘\( … \)
’ construct".
The strings matching the first nine ‘\( … \)
’ constructs appearing in a regular expression are assigned numbers 1 through 9 in the order that the open-parentheses appear in the regular expression. So you can use ‘\1
’ through ‘\9
’ to refer to the text matched by the corresponding ‘\( … \)
’ constructs.
For example, ‘\(.*\)\1
’ matches any newline-free string that is composed of two identical halves. The ‘\(.*\)
’ matches the first half, which may be anything, but the ‘\1
’ that follows must match the same exact text.
If a particular ‘\( … \)
’ construct matches more than once (which can easily happen if it is followed by ‘*
’), only the last match is recorded.
\`
matches the empty string, but only at the beginning of the string or buffer (or its accessible portion) being matched against.
\'
matches the empty string, but only at the end of the string or buffer (or its accessible portion) being matched against.
\=
matches the empty string, but only at point.
\b
matches the empty string, but only at the beginning or end of a word. Thus, ‘\bfoo\b
’ matches any occurrence of ‘foo
’ as a separate word. ‘\bballs?\b
’ matches ‘ball
’ or ‘balls
’ as a separate word.
‘\b
’ matches at the beginning or end of the buffer regardless of what text appears next to it.
\B
matches the empty string, but not at the beginning or end of a word.
\<
matches the empty string, but only at the beginning of a word. ‘\<
’ matches at the beginning of the buffer only if a word-constituent character follows.
\>
matches the empty string, but only at the end of a word. ‘\>
’ matches at the end of the buffer only if the contents end with a word-constituent character.
\w
matches any word-constituent character. The syntax table determines which characters these are. See Syntax Tables in The Emacs Lisp Reference Manual.
\W
matches any character that is not a word-constituent.
\_<
matches the empty string, but only at the beginning of a symbol. A symbol is a sequence of one or more symbol-constituent characters. A symbol-constituent character is a character whose syntax is either ‘w
’ or ‘_
’. ‘\_<
’ matches at the beginning of the buffer only if a symbol-constituent character follows. As with words, the syntax table determines which characters are symbol-constituent.
\_>
matches the empty string, but only at the end of a symbol. ‘\_>
’ matches at the end of the buffer only if the contents end with a symbol-constituent character.
\sc
matches any character whose syntax is c
. Here c
is a character that designates a particular syntax class: thus, ‘w
’ for word constituent, ‘-
’ or ‘
’ for whitespace, ‘.
’ for ordinary punctuation, etc. See Syntax Class Table in The Emacs Lisp Reference Manual.
\Sc
matches any character whose syntax is not c
.
\cc
matches any character that belongs to the category c
. For example, ‘\cc
’ matches Chinese characters, ‘\cg
’ matches Greek characters, etc. For the description of the known categories, type M-x describe-categories RET
.
\Cc
matches any character that does not belong to category c
.
The constructs that pertain to words and syntax are controlled by the setting of the syntax table. See Syntax Tables in The Emacs Lisp Reference Manual.