Building the PSI Tree
One of the main responsibilities of a custom language implementation is to build a PSI tree which represents the syntactic structure of the file, from class declarations, to keywords, to whitespace and comments. This tree is required for just about all other functionality in ReSharper - refactoring, inspections, formatting, etc. all require manipulating or walking a PSI tree.
ReSharper follows the traditional method of building a parse tree, namely, lexical analysis ("lexing") and parsing. Lexing is the process of converting an input text buffer into a stream of tokens, such as keyword, identifier, whitespace, operator and so on. Parsing then analyses this stream of tokens, looking for known sequences, such as class declaration, assignment operation, method invocation, etc.
The result of parsing is to build a tree structure, where the leaf nodes represent the tokens from lexing (the terminal symbols), and the interior or branch nodes are semantic constructs, that represent the class declaration, etc. (non-terminals). It is now possible to walk the tree to find out what makes up a file, what statements and expressions are in a method declaration, and what identifiers and operators are in an expression.
PSI tree structure
All languages supported by ReSharper implement a tree that uses the same base interfaces. While the different nodes in each tree are implemented by different classes, they all implement the
ITreeNode interface, with the root node also implementing
IFile. This allows for common methods for manipulating and walking the tree. As such, building the PSI tree is a similar exercise for all languages, but it also means that all language support must be implemented from scratch to ReSharper's requirements. There is no direct support for third party tooling, however, as long as the tooling can efficiently create the correct data structures, there is no requirement to use the ReSharper provided tooling (care should be taken to avoid allocations as much as possible - files are reparsed very frequently).
To identify the type of a node, the
ITreeNode interface includes the
NodeType property, which is also of type
NodeType. Each language creates classes that derive from
NodeType that represent the types of each node in the tree. For example, the leaf nodes of the tree, which represent the tokens identified during lexing, all derive from
TokenNodeType, and the interior nodes derive from
When lexing, ReSharper languages produce a stream of tokens, by returning references to singleton instances of
TokenNodeType derived classes, such as C#'s
ClassKeywordNodeType. The lexer needs to implement the
ILexer interface, which exposes the current token type and the start and end offset of the token itself.
The lexer is usually generated from a
.lex file that describes the rules to create the tokens. The ReSharper SDK ships a tool to convert the
.lex file into a class that will implement the lexer. This is a partial class, and requires an accompanying boilerplate partial class that will implement
ILexer based on the generated lexer.
ReSharper also provides support for incremental lexing, which will reuse existing parts of a stream of tokens when only a portion of a file has changed. The results of lexing can also be persisted to ReSharper's disk based caches, to speed up re-opening a file.
The parser uses an
ILexer to advance through the file, analysing the token types it discovers along the way. It first creates an instance of
IFile, and then adds children which represent the top level constructs in the file (e.g. a C# file would have
using statements, namespace declarations or class declarations). Each child would then have further children (e.g. a namespace declaration would have the
namespace keyword, whitespace, the name identifier of the namespace as well as the namespace body), and so on (class declarations, type members, expressions, statements, etc.).
The parser uses methods on
TreeElementFactory to create nodes in the tree (ReSharper also, and perhaps confusingly, uses the term "element" when describing node instances in the tree. That is, implementations of
ITreeNode are referred to as elements, while the interfaces use "node").
When creating an interior node, the parser passes the node type (a singleton instance derived from
CompositeNodeType) in a call to
CreateCompositeElement. This in turn calls
CompositeNodeType.Create, and the derived instance of
CompositeNodeType creates the
ITreeNode instance and returns it, ready to be added to the tree.
To create a leaf node in the tree, the parser uses the
TreeElementFactory.CreateLeafElement, passing in the token node type from the lexer. This also produces an
ITreeNode, which is then added to the tree.
If there is a syntax error in the file, the parser adds a special
ITreeNode instance that implements
IErrorElement, and tries to recover with the next recognised lexer tokens. This means the whole file is always parsed, and a valid tree is always created, even if it contains errors.
The parser implements the
IParser interface, and is usually generated from a
.psi file, which is a proprietary file format the describes the grammar of the language being parsed. The ReSharper SDK ships with a tool to convert the
.psi file to a generated class that implements these rules. Alternatively, a parser can be hand written, although it is still useful to use a
.psi file to generate the tree node classes and interfaces.
A hand written parser is useful if the language is extremely simple, or if the language is too complex to be represented in a
ReSharper supports incremental reparsing, whereby only the affected sub-tree is replaced, rather than having to reparse the whole file. This requires special block nodes that know how to reparse their own content.