34.3.3 The rx Structured Regexp Notation
As an alternative to the string-based syntax, Emacs provides the structured rx
notation based on Lisp S-expressions. This notation is usually easier to read, write and maintain than regexp strings, and can be indented and commented freely. It requires a conversion into string form since that is what regexp functions expect, but that conversion typically takes place during byte-compilation rather than when the Lisp code using the regexp is run.
Here is an rx
regexp1 that matches a block comment in the C programming language:
(rx "/*" ; Initial /*
(zero-or-more
(or (not (any "*")) ; Either non-*,
(seq "*" ; or * followed by
(not (any "/"))))) ; non-/
(one-or-more "*") ; At least one star,
"/") ; and the final /
or, using shorter synonyms and written more compactly,
(rx "/*"
(* (| (not "*")
(: "*" (not "/"))))
(+ "*") "/")
In conventional string syntax, it would be written
"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
The rx
notation is mainly useful in Lisp code; it cannot be used in most interactive situations where a regexp is requested, such as when running query-replace-regexp
or in variable customization.
β’ Rx Constructs | Β Β | Constructs valid in rx forms. |
β’ Rx Functions | Β Β | Functions and macros that use rx forms. |
β’ Extending Rx | Β Β | How to define your own rx forms. |
34.3.3.1 Constructs in rx
regexpsβ
The various forms in rx
regexps are described below. The shorthand rx
represents any rx
form, and rx
β¦ means zero or more rx
forms. Where the corresponding string regexp syntax is given, A
, B
, β¦ are string regexp subexpressions.
Literalsβ
"some-string"
β
Match the string βsome-string
β literally. There are no characters with special meaning, unlike in string regexps.
?C
β
Match the character βC
β literally.
Sequence and alternativeβ
(seq rxβ¦)
β
(sequence rxβ¦)
β
(: rxβ¦)
β
(and rxβ¦)
β
Match the rx
s in sequence. Without arguments, the expression matches the empty string.\
Corresponding string regexp: βABβ¦
β (subexpressions in sequence).
(or rxβ¦)
β
(| rxβ¦)
β
Match exactly one of the rx
s. If all arguments are strings, characters, or or
forms so constrained, the longest possible match will always be used. Otherwise, either the longest match or the first (in left-to-right order) will be used. Without arguments, the expression will not match anything at all.\
Corresponding string regexp: βA\|B\|β¦
β.
unmatchable
β
Refuse any match. Equivalent to (or)
. See regexp-unmatchable.
Repetitionβ
Normally, repetition forms are greedy, in that they attempt to match as many times as possible. Some forms are non-greedy; they try to match as few times as possible (see Non-greedy repetition).
(zero-or-more rxβ¦)
β
(0+ rxβ¦)
β
Match the rx
s zero or more times. Greedy by default.\
Corresponding string regexp: βA*
β (greedy), βA*?
β (non-greedy)
(one-or-more rxβ¦)
β
(1+ rxβ¦)
β
Match the rx
s one or more times. Greedy by default.\
Corresponding string regexp: βA+
β (greedy), βA+?
β (non-greedy)
(zero-or-one rxβ¦)
β
(optional rxβ¦)
β
(opt rxβ¦)
β
Match the rx
s once or an empty string. Greedy by default.\
Corresponding string regexp: βA?
β (greedy), βA??
β (non-greedy).
(* rxβ¦)
β
Match the rx
s zero or more times. Greedy.\
Corresponding string regexp: βA*
β
(+ rxβ¦)
β
Match the rx
s one or more times. Greedy.\
Corresponding string regexp: βA+
β
(? rxβ¦)
β
Match the rx
s once or an empty string. Greedy.\
Corresponding string regexp: βA?
β
(*? rxβ¦)
β
Match the rx
s zero or more times. Non-greedy.\
Corresponding string regexp: βA*?
β
(+? rxβ¦)
β
Match the rx
s one or more times. Non-greedy.\
Corresponding string regexp: βA+?
β
(?? rxβ¦)
β
Match the rx
s or an empty string. Non-greedy.\
Corresponding string regexp: βA??
β
(= n rxβ¦)
β
(repeat n rx)
β
Match the rx
s exactly n
times.\
Corresponding string regexp: βA\{n\}
β
(>= n rxβ¦)
β
Match the rx
s n
or more times. Greedy.\
Corresponding string regexp: βA\{n,\}
β
(** n m rxβ¦)
β
(repeat n m rxβ¦)
β
Match the rx
s at least n
but no more than m
times. Greedy.\
Corresponding string regexp: βA\{n,m\}
β
The greediness of some repetition forms can be controlled using the following constructs. However, it is usually better to use the explicit non-greedy forms above when such matching is required.
(minimal-match rx)
β
Match rx
, with zero-or-more
, 0+
, one-or-more
, 1+
, zero-or-one
, opt
and optional
using non-greedy matching.
(maximal-match rx)
β
Match rx
, with zero-or-more
, 0+
, one-or-more
, 1+
, zero-or-one
, opt
and optional
using greedy matching. This is the default.
Matching single charactersβ
(any setβ¦)
β
(char setβ¦)
β
(in setβ¦)
β
Match a single character from one of the set
s. Each set
is a character, a string representing the set of its characters, a range or a character class (see below). A range is either a hyphen-separated string like "A-Z"
, or a cons of characters like (?A . ?Z)
.
Note that hyphen (-
) is special in strings in this construct, since it acts as a range separator. To include a hyphen, add it as a separate character or single-character string.\
Corresponding string regexp: β[β¦]
β
(not charspec)
β
Match a character not included in charspec
. charspec
can be a character, a single-character string, an any
, not
, or
, intersection
, syntax
or category
form, or a character class. If charspec
is an or
form, its arguments have the same restrictions as those of intersection
; see below.\
Corresponding string regexp: β[^β¦]
β, β\Scode
β, β\Ccode
β
(intersection charsetβ¦)
β
Match a character included in all of the charset
s. Each charset
can be a character, a single-character string, an any
form without character classes, or an intersection
, or
or not
form whose arguments are also charset
s.
not-newline
, nonl
β
Match any character except a newline.\
Corresponding string regexp: β.
β (dot)
anychar
, anything
β
Match any character.\
Corresponding string regexp: β.\|\n
β (for example)
character classβ
Match a character from a named character class:
alpha
, alphabetic
, letter
β
Match alphabetic characters. More precisely, match characters whose Unicode βgeneral-category
β property indicates that they are alphabetic.
alnum
, alphanumeric
β
Match alphabetic characters and digits. More precisely, match characters whose Unicode βgeneral-category
β property indicates that they are alphabetic or decimal digits.
digit
, numeric
, num
β
Match the digits β0
βββ9
β.
xdigit
, hex-digit
, hex
β
Match the hexadecimal digits β0
βββ9
β, βA
βββF
β and βa
βββf
β.
cntrl
, control
β
Match any character whose code is in the range 0β31.
blank
β
Match horizontal whitespace. More precisely, match characters whose Unicode βgeneral-category
β property indicates that they are spacing separators.
space
, whitespace
, white
β
Match any character that has whitespace syntax (see Syntax Class Table).
lower
, lower-case
β
Match anything lower-case, as determined by the current case table. If case-fold-search
is non-nil, this also matches any upper-case letter.
upper
, upper-case
β
Match anything upper-case, as determined by the current case table. If case-fold-search
is non-nil, this also matches any lower-case letter.
graph
, graphic
β
Match any character except whitespace, ASCII and non-ASCII control characters, surrogates, and codepoints unassigned by Unicode, as indicated by the Unicode βgeneral-category
β property.
print
, printing
β
Match whitespace or a character matched by graph
.
punct
, punctuation
β
Match any punctuation character. (At present, for multibyte characters, anything that has non-word syntax.)
word
, wordchar
β
Match any character that has word syntax (see Syntax Class Table).
ascii
β
Match any ASCII character (codes 0β127).
nonascii
β
Match any non-ASCII character (but not raw bytes).
Corresponding string regexp: β[[:class:]]
β
(syntax syntax)
β
Match a character with syntax syntax
, being one of the following names:
Syntax name | Syntax character |
---|---|
whitespace | - |
punctuation | . |
word | w |
symbol | _ |
open-parenthesis | ( |
close-parenthesis | ) |
expression-prefix | ' |
string-quote | " |
paired-delimiter | $ |
escape | \ |
character-quote | / |
comment-start | < |
comment-end | > |
string-delimiter | \| |
comment-delimiter | ! |
For details, see Syntax Class Table. Please note that (syntax punctuation)
is not equivalent to the character class punctuation
.\
Corresponding string regexp: β\scode
β
(category category)
β
Match a character in category category
, which is either one of the names below or its category character.
Category name | Category character |
---|---|
space-for-indent | space |
base | . |
consonant | 0 |
base-vowel | 1 |
upper-diacritical-mark | 2 |
lower-diacritical-mark | 3 |
tone-mark | 4 |
symbol | 5 |
digit | 6 |
vowel-modifying-diacritical-mark | 7 |
vowel-sign | 8 |
semivowel-lower | 9 |
not-at-end-of-line | < |
not-at-beginning-of-line | > |
alpha-numeric-two-byte | A |
chinese-two-byte | C |
greek-two-byte | G |
japanese-hiragana-two-byte | H |
indian-two-byte | I |
japanese-katakana-two-byte | K |
strong-left-to-right | L |
korean-hangul-two-byte | N |
strong-right-to-left | R |
cyrillic-two-byte | Y |
combining-diacritic | ^ |
ascii | a |
arabic | b |
chinese | c |
ethiopic | e |
greek | g |
korean | h |
indian | i |
japanese | j |
japanese-katakana | k |
latin | l |
lao | o |
tibetan | q |
japanese-roman | r |
thai | t |
vietnamese | v |
hebrew | w |
cyrillic | y |
can-break | \| |
For more information about currently defined categories, run the command M-x describe-categories RET
. For how to define new categories, see Categories.\
Corresponding string regexp: β\ccode
β
Zero-width assertionsβ
These all match the empty string, but only in specific places.
line-start
, bol
β
Match at the beginning of a line.\
Corresponding string regexp: β^
β
line-end
, eol
β
Match at the end of a line.\
Corresponding string regexp: β$
β
string-start
, bos
, buffer-start
, bot
β
Match at the start of the string or buffer being matched against.\
Corresponding string regexp: β\`
β
string-end
, eos
, buffer-end
, eot
β
Match at the end of the string or buffer being matched against.\
Corresponding string regexp: β\'
β
point
β
Match at point.\
Corresponding string regexp: β\=
β
word-start
, bow
β
Match at the beginning of a word.\
Corresponding string regexp: β\<
β
word-end
, eow
β
Match at the end of a word.\
Corresponding string regexp: β\>
β
word-boundary
β
Match at the beginning or end of a word.\
Corresponding string regexp: β\b
β
not-word-boundary
β
Match anywhere but at the beginning or end of a word.\
Corresponding string regexp: β\B
β
symbol-start
β
Match at the beginning of a symbol.\
Corresponding string regexp: β\_<
β
symbol-end
β
Match at the end of a symbol.\
Corresponding string regexp: β\_>
β
Capture groupsβ
(group rxβ¦)
β
(submatch rxβ¦)
β
Match the rx
s, making the matched text and position accessible in the match data. The first group in a regexp is numbered 1; subsequent groups will be numbered one higher than the previous group.\
Corresponding string regexp: β\(β¦\)
β
(group-n n rxβ¦)
β
(submatch-n n rxβ¦)
β
Like group
, but explicitly assign the group number n
. n
must be positive.\
Corresponding string regexp: β\(?n:β¦\)
β
(backref n)
β
Match the text previously matched by group number n
. n
must be in the range 1β9.\
Corresponding string regexp: β\n
β
Dynamic inclusionβ
(literal expr)
β
Match the literal string that is the result from evaluating the Lisp expression expr
. The evaluation takes place at call time, in the current lexical environment.
(regexp expr)
β
(regex expr)
β
Match the string regexp that is the result from evaluating the Lisp expression expr
. The evaluation takes place at call time, in the current lexical environment.
(eval expr)
β
Match the rx form that is the result from evaluating the Lisp expression expr
. The evaluation takes place at macro-expansion time for rx
, at call time for rx-to-string
, in the current global environment.
34.3.3.2 Functions and macros using rx
regexpsβ
macro
rx rx-exprβ¦β
Translate the rx-expr
s to a string regexp, as if they were the body of a (seq β¦)
form. The rx
macro expands to a string constant, or, if literal
or regexp
forms are used, a Lisp expression that evaluates to a string.
function
rx-to-string rx-expr \&optional no-groupβ
Translate rx-expr
to a string regexp which is returned. If no-group
is absent or nil, bracket the result in a non-capturing group, β\(?:β¦\)
β, if necessary to ensure that a postfix operator appended to it will apply to the whole expression.
Arguments to literal
and regexp
forms in rx-expr
must be string literals.
The pcase
macro can use rx
expressions as patterns directly; see rx in pcase.
For mechanisms to add user-defined extensions to the rx
notation, see Extending Rx.
34.3.3.3 Defining new rx
formsβ
The rx
notation can be extended by defining new symbols and parameterized forms in terms of other rx
expressions. This is handy for sharing parts between several regexps, and for making complex ones easier to build and understand by putting them together from smaller pieces.
For example, you could define name
to mean (one-or-more letter)
, and (quoted x)
to mean (seq ?' x ?')
for any x
. These forms could then be used in rx
expressions like any other: (rx (quoted name))
would match a nonempty sequence of letters inside single quotes.
The Lisp macros below provide different ways of binding names to definitions. Common to all of them are the following rules:
- Built-in
rx
forms, likedigit
andgroup
, cannot be redefined. - The definitions live in a name space of their own, separate from that of Lisp variables. There is thus no need to attach a suffix like
-regexp
to names; they cannot collide with anything else. - Definitions cannot refer to themselves recursively, directly or indirectly. If you find yourself needing this, you want a parser, not a regular expression.
- Definitions are only ever expanded in calls to
rx
orrx-to-string
, not merely by their presence in definition macros. This means that the order of definitions doesnβt matter, even when they refer to each other, and that syntax errors only show up when they are used, not when they are defined. - User-defined forms are allowed wherever arbitrary
rx
expressions are expected; for example, in the body of azero-or-one
form, but not insideany
orcategory
forms. They are also allowed insidenot
andintersection
forms.
macro
rx-define name [arglist] rx-formβ
Define name
globally in all subsequent calls to rx
and rx-to-string
. If arglist
is absent, then name
is defined as a plain symbol to be replaced with rx-form
. Example:
(rx-define haskell-comment (seq "--" (zero-or-more nonl)))
(rx haskell-comment)
β "--.*"
If arglist
is present, it must be a list of zero or more argument names, and name
is then defined as a parameterized form. When used in an rx
expression as (name argβ¦)
, each arg
will replace the corresponding argument name inside rx-form
.
arglist
may end in &rest
and one final argument name, denoting a rest parameter. The rest parameter will expand to all extra actual argument values not matched by any other parameter in arglist
, spliced into rx-form
where it occurs. Example:
(rx-define moan (x y &rest r) (seq x (one-or-more y) r "!"))
(rx (moan "MOO" "A" "MEE" "OW"))
β "MOOA+MEEOW!"
Since the definition is global, it is recommended to give name
a package prefix to avoid name clashes with definitions elsewhere, as is usual when naming non-local variables and functions.
macro
rx-let (bindingsβ¦) bodyβ¦β
Make the rx
definitions in bindings
available locally for rx
macro invocations in body
, which is then evaluated.
Each element of bindings
is on the form (nameΒ [arglist]Β rx-form)
, where the parts have the same meaning as in rx-define
above. Example:
(rx-let ((comma-separated (item) (seq item (0+ "," item)))
(number (1+ digit))
(numbers (comma-separated number)))
(re-search-forward (rx "(" numbers ")")))
The definitions are only available during the macro-expansion of body
, and are thus not present during execution of compiled code.
rx-let
can be used not only inside a function, but also at top level to include global variable and function definitions that need to share a common set of rx
forms. Since the names are local inside body
, there is no need for any package prefixes. Example:
(rx-let ((phone-number (seq (opt ?+) (1+ (any digit ?-)))))
(defun find-next-phone-number ()
(re-search-forward (rx phone-number)))
(defun phone-number-p (string)
(string-match-p (rx bos phone-number eos) string)))
The scope of the rx-let
bindings is lexical, which means that they are not visible outside body
itself, even in functions called from body
.
macro
rx-let-eval bindings bodyβ¦β
Evaluate bindings
to a list of bindings as in rx-let
, and evaluate body
with those bindings in effect for calls to rx-to-string
.
This macro is similar to rx-let
, except that the bindings
argument is evaluated (and thus needs to be quoted if it is a list literal), and the definitions are substituted at run time, which is required for rx-to-string
to work. Example:
(rx-let-eval
'((ponder (x) (seq "Where have all the " x " gone?")))
(looking-at (rx-to-string
'(ponder (or "flowers" "young girls"
"left socks")))))
Another difference from rx-let
is that the bindings
are dynamically scoped, and thus also available in functions called from body
. However, they are not visible inside functions defined in body
.
- It could be written much simpler with non-greedy operators (how?), but that would make the example less interesting.β©