Skip to main content

23.7.1 Simple Minded Indentation Engine

SMIE is a package that provides a generic navigation and indentation engine. Based on a very simple parser using an operator precedence grammar, it lets major modes extend the sexp-based navigation of Lisp to non-Lisp languages as well as provide a simple to use but reliable auto-indentation.

Operator precedence grammar is a very primitive technology for parsing compared to some of the more common techniques used in compilers. It has the following characteristics: its parsing power is very limited, and it is largely unable to detect syntax errors, but it has the advantage of being algorithmically efficient and able to parse forward just as well as backward. In practice that means that SMIE can use it for indentation based on backward parsing, that it can provide both forward-sexp and backward-sexp functionality, and that it will naturally work on syntactically incorrect code without any extra effort. The downside is that it also means that most programming languages cannot be parsed correctly using SMIE, at least not without resorting to some special tricks (see SMIE Tricks).

SMIE setup  SMIE setup and features.
Operator Precedence Grammars  A very simple parsing technique.
SMIE Grammar  Defining the grammar of a language.
SMIE Lexer  Defining tokens.
SMIE Tricks  Working around the parser’s limitations.
SMIE Indentation  Specifying indentation rules.
SMIE Indentation Helpers  Helper functions for indentation rules.
SMIE Indentation Example  Sample indentation rules.
SMIE Customization  Customizing indentation.

23.7.1.1 SMIE Setup and Features

SMIE is meant to be a one-stop shop for structural navigation and various other features which rely on the syntactic structure of code, in particular automatic indentation. The main entry point is smie-setup which is a function typically called while setting up a major mode.

function smie-setup grammar rules-function \&rest keywords

Setup SMIE navigation and indentation. grammar is a grammar table generated by smie-prec2->grammar. rules-function is a set of indentation rules for use on smie-rules-function. keywords are additional arguments, which can include the following keywords:

  • :forward-token fun: Specify the forward lexer to use.
  • :backward-token fun: Specify the backward lexer to use.

Calling this function is sufficient to make commands such as forward-sexp, backward-sexp, and transpose-sexps be able to properly handle structural elements other than just the paired parentheses already handled by syntax tables. For example, if the provided grammar is precise enough, transpose-sexps can correctly transpose the two arguments of a + operator, taking into account the precedence rules of the language.

Calling smie-setup is also sufficient to make TAB indentation work in the expected way, extends blink-matching-paren to apply to elements like begin...end, and provides some commands that you can bind in the major mode keymap.

command smie-close-block

This command closes the most recently opened (and not yet closed) block.

command smie-down-list \&optional arg

This command is like down-list but it also pays attention to nesting of tokens other than parentheses, such as begin...end.

23.7.1.2 Operator Precedence Grammars

SMIE’s precedence grammars simply give to each token a pair of precedences: the left-precedence and the right-precedence. We say T1 < T2 if the right-precedence of token T1 is less than the left-precedence of token T2. A good way to read this < is as a kind of parenthesis: if we find ... T1 something T2 ... then that should be parsed as ... T1 (something T2 ... rather than as ... T1 something) T2 .... The latter interpretation would be the case if we had T1 > T2. If we have T1 = T2, it means that token T2 follows token T1 in the same syntactic construction, so typically we have "begin" = "end". Such pairs of precedences are sufficient to express left-associativity or right-associativity of infix operators, nesting of tokens like parentheses and many other cases.

function smie-prec2-﹥grammar table

This function takes a prec2 grammar table and returns an alist suitable for use in smie-setup. The prec2 table is itself meant to be built by one of the functions below.

function smie-merge-prec2s \&rest tables

This function takes several prec2 tables and merges them into a new prec2 table.

function smie-precs-﹥prec2 precs

This function builds a prec2 table from a table of precedences precs. precs should be a list, sorted by precedence (for example "+" will come before "*"), of elements of the form (assoc op ...), where each op is a token that acts as an operator; assoc is their associativity, which can be either left, right, assoc, or nonassoc. All operators in a given element share the same precedence level and associativity.

function smie-bnf-﹥prec2 bnf \&rest resolvers

This function lets you specify the grammar using a BNF notation. It accepts a bnf description of the grammar along with a set of conflict resolution rules resolvers, and returns a prec2 table.

bnf is a list of nonterminal definitions of the form (nonterm rhs1 rhs2 ...) where each rhs is a (non-empty) list of terminals (aka tokens) or non-terminals.

Not all grammars are accepted:

  • An rhs cannot be an empty list (an empty list is never needed, since SMIE allows all non-terminals to match the empty string anyway).
  • An rhs cannot have 2 consecutive non-terminals: each pair of non-terminals needs to be separated by a terminal (aka token). This is a fundamental limitation of operator precedence grammars.

Additionally, conflicts can occur:

  • The returned prec2 table holds constraints between pairs of tokens, and for any given pair only one constraint can be present: T1 ﹤ T2, T1 = T2, or T1 ﹥ T2.
  • A token can be an opener (something similar to an open-paren), a closer (like a close-paren), or neither of the two (e.g., an infix operator, or an inner token like "else").

Precedence conflicts can be resolved via resolvers, which is a list of precs tables (see smie-precs->prec2): for each precedence conflict, if those precs tables specify a particular constraint, then the conflict is resolved by using this constraint instead, else a conflict is reported and one of the conflicting constraints is picked arbitrarily and the others are simply ignored.

23.7.1.3 Defining the Grammar of a Language

The usual way to define the SMIE grammar of a language is by defining a new global variable that holds the precedence table by giving a set of BNF rules. For example, the grammar definition for a small Pascal-like language could look like:

(require 'smie)
(defvar sample-smie-grammar
(smie-prec2->grammar
(smie-bnf->prec2
    '((id)
(inst ("begin" insts "end")
("if" exp "then" inst "else" inst)
(id ":=" exp)
(exp))
(insts (insts ";" insts) (inst))
(exp (exp "+" exp)
(exp "*" exp)
("(" exps ")"))
(exps (exps "," exps) (exp)))
    '((assoc ";"))
'((assoc ","))
'((assoc "+") (assoc "*")))))

A few things to note:

  • The above grammar does not explicitly mention the syntax of function calls: SMIE will automatically allow any sequence of sexps, such as identifiers, balanced parentheses, or begin ... end blocks to appear anywhere anyway.
  • The grammar category id has no right hand side: this does not mean that it can match only the empty string, since as mentioned any sequence of sexps can appear anywhere anyway.
  • Because non terminals cannot appear consecutively in the BNF grammar, it is difficult to correctly handle tokens that act as terminators, so the above grammar treats ";" as a statement separator instead, which SMIE can handle very well.
  • Separators used in sequences (such as "," and ";" above) are best defined with BNF rules such as (foo (foo "separator" foo) ...) which generate precedence conflicts which are then resolved by giving them an explicit (assoc "separator").
  • The ("(" exps ")") rule was not needed to pair up parens, since SMIE will pair up any characters that are marked as having paren syntax in the syntax table. What this rule does instead (together with the definition of exps) is to make it clear that "," should not appear outside of parentheses.
  • Rather than have a single precs table to resolve conflicts, it is preferable to have several tables, so as to let the BNF part of the grammar specify relative precedences where possible.
  • Unless there is a very good reason to prefer left or right, it is usually preferable to mark operators as associative, using assoc. For that reason "+" and "*" are defined above as assoc, although the language defines them formally as left associative.

23.7.1.4 Defining Tokens

SMIE comes with a predefined lexical analyzer which uses syntax tables in the following way: any sequence of characters that have word or symbol syntax is considered a token, and so is any sequence of characters that have punctuation syntax. This default lexer is often a good starting point but is rarely actually correct for any given language. For example, it will consider "2,+3" to be composed of 3 tokens: "2", ",+", and "3".

To describe the lexing rules of your language to SMIE, you need 2 functions, one to fetch the next token, and another to fetch the previous token. Those functions will usually first skip whitespace and comments and then look at the next chunk of text to see if it is a special token. If so it should skip the token and return a description of this token. Usually this is simply the string extracted from the buffer, but it can be anything you want. For example:

(defvar sample-keywords-regexp
(regexp-opt '("+" "*" "," ";" ">" ">=" "<" "<=" ":=" "=")))
(defun sample-smie-forward-token ()
(forward-comment (point-max))
(cond
((looking-at sample-keywords-regexp)
(goto-char (match-end 0))
(match-string-no-properties 0))
(t (buffer-substring-no-properties
(point)
(progn (skip-syntax-forward "w_")
(point))))))
(defun sample-smie-backward-token ()
(forward-comment (- (point)))
(cond
((looking-back sample-keywords-regexp (- (point) 2) t)
(goto-char (match-beginning 0))
(match-string-no-properties 0))
(t (buffer-substring-no-properties
(point)
(progn (skip-syntax-backward "w_")
(point))))))

Notice how those lexers return the empty string when in front of parentheses. This is because SMIE automatically takes care of the parentheses defined in the syntax table. More specifically if the lexer returns nil or an empty string, SMIE tries to handle the corresponding text as a sexp according to syntax tables.

23.7.1.5 Living With a Weak Parser

The parsing technique used by SMIE does not allow tokens to behave differently in different contexts. For most programming languages, this manifests itself by precedence conflicts when converting the BNF grammar.

Sometimes, those conflicts can be worked around by expressing the grammar slightly differently. For example, for Modula-2 it might seem natural to have a BNF grammar that looks like this:

  ...
(inst ("IF" exp "THEN" insts "ELSE" insts "END")
("CASE" exp "OF" cases "END")
...)
(cases (cases "|" cases)
(caselabel ":" insts)
("ELSE" insts))
...

But this will create conflicts for "ELSE": on the one hand, the IF rule implies (among many other things) that "ELSE" = "END"; but on the other hand, since "ELSE" appears within cases, which appears left of "END", we also have "ELSE" > "END". We can solve the conflict either by using:

  ...
(inst ("IF" exp "THEN" insts "ELSE" insts "END")
("CASE" exp "OF" cases "END")
("CASE" exp "OF" cases "ELSE" insts "END")
...)
(cases (cases "|" cases) (caselabel ":" insts))
...

or

  ...
(inst ("IF" exp "THEN" else "END")
("CASE" exp "OF" cases "END")
...)
(else (insts "ELSE" insts))
(cases (cases "|" cases) (caselabel ":" insts) (else))
...

Reworking the grammar to try and solve conflicts has its downsides, tho, because SMIE assumes that the grammar reflects the logical structure of the code, so it is preferable to keep the BNF closer to the intended abstract syntax tree.

Other times, after careful consideration you may conclude that those conflicts are not serious and simply resolve them via the resolvers argument of smie-bnf->prec2. Usually this is because the grammar is simply ambiguous: the conflict does not affect the set of programs described by the grammar, but only the way those programs are parsed. This is typically the case for separators and associative infix operators, where you want to add a resolver like '((assoc "|")). Another case where this can happen is for the classic dangling else problem, where you will use '((assoc "else" "then")). It can also happen for cases where the conflict is real and cannot really be resolved, but it is unlikely to pose a problem in practice.

Finally, in many cases some conflicts will remain despite all efforts to restructure the grammar. Do not despair: while the parser cannot be made more clever, you can make the lexer as smart as you want. So, the solution is then to look at the tokens involved in the conflict and to split one of those tokens into 2 (or more) different tokens. E.g., if the grammar needs to distinguish between two incompatible uses of the token "begin", make the lexer return different tokens (say "begin-fun" and "begin-plain") depending on which kind of "begin" it finds. This pushes the work of distinguishing the different cases to the lexer, which will thus have to look at the surrounding text to find ad-hoc clues.

23.7.1.6 Specifying Indentation Rules

Based on the provided grammar, SMIE will be able to provide automatic indentation without any extra effort. But in practice, this default indentation style will probably not be good enough. You will want to tweak it in many different cases.

SMIE indentation is based on the idea that indentation rules should be as local as possible. To this end, it relies on the idea of virtual indentation, which is the indentation that a particular program point would have if it were at the beginning of a line. Of course, if that program point is indeed at the beginning of a line, its virtual indentation is its current indentation. But if not, then SMIE uses the indentation algorithm to compute the virtual indentation of that point. Now in practice, the virtual indentation of a program point does not have to be identical to the indentation it would have if we inserted a newline before it. To see how this works, the SMIE rule for indentation after a { in C does not care whether the { is standing on a line of its own or is at the end of the preceding line. Instead, these different cases are handled in the indentation rule that decides how to indent before a {.

Another important concept is the notion of parent: The parent of a token, is the head token of the nearest enclosing syntactic construct. For example, the parent of an else is the if to which it belongs, and the parent of an if, in turn, is the lead token of the surrounding construct. The command backward-sexp jumps from a token to its parent, but there are some caveats: for openers (tokens which start a construct, like if), you need to start with point before the token, while for others you need to start with point after the token. backward-sexp stops with point before the parent token if that is the opener of the token of interest, and otherwise it stops with point after the parent token.

SMIE indentation rules are specified using a function that takes two arguments method and arg where the meaning of arg and the expected return value depend on method.

method can be:

  • :after, in which case arg is a token and the function should return the offset to use for indentation after arg.
  • :before, in which case arg is a token and the function should return the offset to use to indent arg itself.
  • :elem, in which case the function should return either the offset to use to indent function arguments (if arg is the symbol arg) or the basic indentation step (if arg is the symbol basic).
  • :list-intro, in which case arg is a token and the function should return non-nil if the token is followed by a list of expressions (not separated by any token) rather than an expression.

When arg is a token, the function is called with point just before that token. A return value of nil always means to fallback on the default behavior, so the function should return nil for arguments it does not expect.

offset can be:

  • nil: use the default indentation rule.
  • (column . column): indent to column column.
  • number: offset by number, relative to a base token which is the current token for :after and its parent for :before.

23.7.1.7 Helper Functions for Indentation Rules

SMIE provides various functions designed specifically for use in the indentation rules function (several of those functions break if used in another context). These functions all start with the prefix smie-rule-.

function smie-rule-bolp

Return non-nil if the current token is the first on the line.

function smie-rule-hanging-p

Return non-nil if the current token is hanging. A token is hanging if it is the last token on the line and if it is preceded by other tokens: a lone token on a line is not hanging.

function smie-rule-next-p \&rest tokens

Return non-nil if the next token is among tokens.

function smie-rule-prev-p \&rest tokens

Return non-nil if the previous token is among tokens.

function smie-rule-parent-p \&rest parents

Return non-nil if the current token’s parent is among parents.

function smie-rule-sibling-p

Return non-nil if the current token’s parent is actually a sibling. This is the case for example when the parent of a "," is just the previous ",".

function smie-rule-parent \&optional offset

Return the proper offset to align the current token with the parent. If non-nil, offset should be an integer giving an additional offset to apply.

function smie-rule-separator method

Indent current token as a separator.

By separator, we mean here a token whose sole purpose is to separate various elements within some enclosing syntactic construct, and which does not have any semantic significance in itself (i.e., it would typically not exist as a node in an abstract syntax tree).

Such a token is expected to have an associative syntax and be closely tied to its syntactic parent. Typical examples are "," in lists of arguments (enclosed inside parentheses), or ";" in sequences of instructions (enclosed in a {...} or begin...end block).

method should be the method name that was passed to smie-rules-function.

23.7.1.8 Sample Indentation Rules

Here is an example of an indentation function:

(defun sample-smie-rules (kind token)
(pcase (cons kind token)
(`(:elem . basic) sample-indent-basic)
(`(,_ . ",") (smie-rule-separator kind))
(`(:after . ":=") sample-indent-basic)
(`(:before . ,(or `"begin" `"(" `"{"))
(if (smie-rule-hanging-p) (smie-rule-parent)))
(`(:before . "if")
(and (not (smie-rule-bolp)) (smie-rule-prev-p "else")
(smie-rule-parent)))))

A few things to note:

  • The first case indicates the basic indentation increment to use. If sample-indent-basic is nil, then SMIE uses the global setting smie-indent-basic. The major mode could have set smie-indent-basic buffer-locally instead, but that is discouraged.

  • The rule for the token "," make SMIE try to be more clever when the comma separator is placed at the beginning of lines. It tries to outdent the separator so as to align the code after the comma; for example:

    x = longfunctionname (
    arg1
    , arg2
    );
  • The rule for indentation after ":=" exists because otherwise SMIE would treat ":=" as an infix operator and would align the right argument with the left one.

  • The rule for indentation before "begin" is an example of the use of virtual indentation: This rule is used only when "begin" is hanging, which can happen only when "begin" is not at the beginning of a line. So this is not used when indenting "begin" itself but only when indenting something relative to this "begin". Concretely, this rule changes the indentation from:

        if x > 0 then begin
    dosomething(x);
    end

    to

        if x > 0 then begin
    dosomething(x);
    end
  • The rule for indentation before "if" is similar to the one for "begin", but where the purpose is to treat "else if" as a single unit, so as to align a sequence of tests rather than indent each test further to the right. This function does this only in the case where the "if" is not placed on a separate line, hence the smie-rule-bolp test.

    If we know that the "else" is always aligned with its "if" and is always at the beginning of a line, we can use a more efficient rule:

    ((equal token "if")
    (and (not (smie-rule-bolp))
    (smie-rule-prev-p "else")
    (save-excursion
    (sample-smie-backward-token)
    (cons 'column (current-column)))))

    The advantage of this formulation is that it reuses the indentation of the previous "else", rather than going all the way back to the first "if" of the sequence.

23.7.1.9 Customizing Indentation

If you are using a mode whose indentation is provided by SMIE, you can customize the indentation to suit your preferences. You can do this on a per-mode basis (using the option smie-config), or a per-file basis (using the function smie-config-local in a file-local variable specification).

user option smie-config

This option lets you customize indentation on a per-mode basis. It is an alist with elements of the form (mode . rules). For the precise form of rules, see the variable’s documentation; but you may find it easier to use the command smie-config-guess.

command smie-config-guess

This command tries to work out appropriate settings to produce your preferred style of indentation. Simply call the command while visiting a file that is indented with your style.

command smie-config-save

Call this command after using smie-config-guess, to save your settings for future sessions.

command smie-config-show-indent \&optional move

This command displays the rules that are used to indent the current line.

command smie-config-set-indent

This command adds a local rule to adjust the indentation of the current line.

function smie-config-local rules

This function adds rules as indentation rules for the current buffer. These add to any mode-specific rules defined by the smie-config option. To specify custom indentation rules for a specific file, add an entry to the file’s local variables of the form: eval: (smie-config-local '(rules)).