Using CsLex

The SDK ships with a tool called CsLex, which is a C# utility for building lexers. It takes in a specially formatted file and generates a C# class that is then compiled into the project. The generated lexer is very efficient, implemented as a set of lookup decision tables that implement the regular expression rules that describe how to match a token.

The version of CsLex in the SDK is based on the version by Brad Merrill, with some modifications, most notably to add support for Unicode text.

This document is not intended to be a full guide to using CsLex. The original documentation is a very good introduction to the file format and how to implement a lexer, and should be considered essential reading before continuing with the version in the ReSharper SDK.

Using CsLex in the SDK

CsLex expects an input file that describes how to match tokens. This file includes some user code, CsLex directives and regular expression rules. See the official documentation for more details.

The ReSharper SDK automatically includes a .targets file that will set up a project to use CsLex.

The input file should be named after the language being analysed, with a .lex suffix, e.g. css.lex. This file should be added to a C# project, and have its Build Action set to CsLex.

When the project is built, the CsLex targets will invoke the CsLex utility with the input file and generate a C# file based on the file name, replacing .lex with _lex.cs, e.g. css.lex will generate css_lex.cs.

This C# file is automatically added to the list of files being compiled, which allows the project to compile correctly without the file being added to the .csproj file. It is recommended to add the _lex.cs file to the project, so that ReSharper can resolve symbols used in the file in surrounding code.

The CsLex utility will also create a file with a _lex.depends suffix. This file does not need to be added to the project, as it simply lists any files that are included into the .lex file, and is recreated on each run. It is used for incremental building, and up-to-date checks - the lexer only needs to be rebuilt if any of its dependencies (or the input file itself) have been modified.

Neither the _lex.cs nor the _lex.depdends files need to be added to source control - they are both recreated whenever the lexer class is regenerated. However, adding them to source control will not cause issues.

The _lex.cs file gets generated as a partial class, which is used to provide a non-generated (but mostly boilerplate) implementation of ILexer or IIncrementalLexer.

Creating a lexer for ReSharper

A CsLex input file is made up of three sections - user code, CsLex directives and regular expression rules.

User code

When creating a lexer for ReSharper, the user code is typically just using statements, which can easily be implemented by looking at a generated _lex.cs file and seeing what is missing. Something like the following is required, although the exact contents will depend on the custom language being implemented:

using System;
using System.Collections;
using JetBrains.Util;
using JetBrains.Text;
using JetBrains.ReSharper.Psi;
using JetBrains.ReSharper.Psi.Parsing;
using JetBrains.ReSharper.Psi.ExtensionsAPI.Tree;

Directives

The CsLex directives should be similar to the following:

%unicode %init{ myCurrentTokenType = null; %init} %namespace MyCustomLanguage.Psi.Parsing %class MyLexerGenerated %public %implements IIncrementalLexer %function _locateToken %virtual %type TokenNodeType %eofval{ myCurrentTokenType = null; return myCurrentTokenType; %eofval} %include Unicode.lex

The directives shown above are typical of the values required for a ReSharper lexer:

%unicode - a switch to indicate that CsLex should generate lookup tables for all Unicode characters, and not just 8-bit character sets.
%init - a block that gets copied verbatim into the generated lexer's constructor. The code here initialises a field called myCurrentTokenType to null. This field is not generated by CsLex, but specified in the non-generated partial class.
%namespace - defines the namespace used for the generated lexer class. A typical namespace for a lexer would end .Psi.Parsing.
%class - specifies the name of the generated lexer class. Typically this is the language name followed by LexerGenerated, e.g. CssLexerGenerated.
%public - causes the class to be generated as public. This is necessary so that the lexer can be instantiated from outside of the defining assembly.
%implements - defines an interface or base class that the generated class will inherit/implement. Typically this will be IIncrementalLexer or ILexer. Strictly speaking, because the class is a partial class, the interface could be defined in the non-generated partial file.
%function - defines the name of the tokenizing function. By default this is yylex, here it is replaced with _locateToken. This name is used as it is the implementation of a method called LocateToken in the non-generated partial class. See below for more details.
%virtual - causes the tokenizing function to be declared as virtual, so it can be overridden in a deriving class. Not strictly necessary for all lexers.
%type - declare the return type of the tokenizing function. This should be TokenNodeType.
%eofval - a block that gets copied verbatim into the lexer, and is executed when the lexer reaches the end of the input file. Here it resets the myCurrentTokenType field to null, and returns null.
%include - includes a file. The contents of that file are treated as though they were always part of the input file, at that location. The parameter of the %include statement is a filename that can be a fully qualified or relative path. If it is a relative path, it is first checked against the location of the input file, and if not found, then checked against the location of the CsLex utility itself. In this way, %include Unicode.lex will find the Unicode.lex file that ships with the SDK and provides regular expressions for various Unicode character classes. See below for details on this file.

Custom languages can also add other directives, as per the CsLex documentation, and of course will define macros to use in the regular expression rules, and the states in which those rules are valid.

WHITE_SPACE_CHAR=({UNICODE_ZS}|(\u0009)|(\u000B)|(\u000C))
WHITE_SPACE=({WHITE_SPACE_CHAR}+)

%state YY_IN_NTH
%state YY_IN_URI
%state YY_IN_JS_EXPRESSION
%state YY_CONDITIONAL

Note the use of the UNICODE_ZS macro from the Unicode.lex file (see below), and that the escape characters for tab, etc. are declared as 16-bit Unicode values.

Rules

The final section is a set of rules, which are defined by three things, a state, a regular expression and an action. An example is:

<YYINITIAL> {WHITE_SPACE} { return CssTokenType.WHITE_SPACE; }

<YYINITIAL> is the state in which the rule can be matched. If the lexer does not define any custom states, this will always be <YYINITIAL>. Note that in the original CsLex, the state is optional, and the rule is matched in all states. This is not true for the version that ships with the SDK - the state is required.
{WHITE_SPACE} is the regular expression that should be matched. A number inside braces, as shown here, is a macro expansion.
The rest of the line is the action that will be invoked when the rule matches. It should be wrapped in braces, and is copied verbatim into the generated lexer C# class.

The typical implementation for an action is to set the myCurrentTokenType field to the appropriate token node type, and then return the same value. This will both set the current token type for anything that needs to check it, but also return it to the calling method.

Some actions might want to change the lexer state, which they can do with the yybegin method, such as yybegin(YY_MYNEWSTATE).

The built-in ReSharper lexers tend to follow a slightly different pattern for the actions:

<YYINITIAL> {WHITE_SPACE} { myCurrentTokenType = makeToken(CssTokenType.WHITE_SPACE); return myCurrentTokenType; }

This passes the chosen token node type to the makeToken method, defined in the partial file implementation (there is no reason for the lowercase 'm'), allowing the partial file implementation a chance to modify the token before it is actually used.

Some lexers will implement makeToken like this:

private TokenNodeType MakeToken(TokenNodeType type)
{
  return myCurrentTokenType = type;
}

This assigns the given token node type to the myCurrentTokenType field, and simply returns it. This is frequently unnecessary, as the rule action also assigns the value to the myCurrentTokenType, as does the method that calls the tokenizing function. A simple return statement is often enough for lexers.

Partial file implementation

CsLex will generate a lexer that can declare that it implements a given interface, IIncrementalLexer, or ILexer. However, it doesn't actually provide an implementation for those interface members. The version of CsLex that ships in the SDK will generate a partial class, which allows extra type members to be defined in a separate .cs file. Typically, this is fairly boilerplate code that references the yy_ variables internal to the lexer:

public partial class MyCustomLexerGenerated
{
  private TokenNodeType myCurrentTokenType = null;

  public void Start()
  {
    Start(0, yy_buffer.Length, YYINITIAL);
  }

  public void Start(int startOffset, int endOffset, uint state)
  {
    yy_buffer_index = startOffset;
    yy_buffer_start = startOffset;
    yy_buffer_end = startOffset;
    yy_eof_pos = endOffset;
    yy_lexical_state = (int) state;
    myCurrentTokenType = null;
  }

  public void Advance()
  {
    myCurrentTokenType = null;
    LocateToken();
  }

  public object CurrentPosition
  {
    get
    {
      TokenPosition tokenPosition;
      tokenPosition.CurrentTokenType = myCurrentTokenType;
      tokenPosition.YyBufferIndex = yy_buffer_index;
      tokenPosition.YyBufferStart = yy_buffer_start;
      tokenPosition.YyBufferEnd = yy_buffer_end;
      tokenPosition.YyLexicalState = yy_lexical_state;
      return tokenPosition;
    }
    set
    {
      var tokenPosition = (TokenPosition) value;
      myCurrentTokenType = tokenPosition.CurrentTokenType;
      yy_buffer_index = tokenPosition.YyBufferIndex;
      yy_buffer_start = tokenPosition.YyBufferStart;
      yy_buffer_end = tokenPosition.YyBufferEnd;
      yy_lexical_state = tokenPosition.YyLexicalState;
    }
  }

  public TokenNodeType TokenType
  {
    get
    {
      LocateToken();
      return myCurrentTokenType;
    }
  }

  public int TokenStart
  {
    get
    {
      LocateToken();
      return yy_buffer_start;
    }
  }

  public int TokenEnd
  {
    get
    {
      LocateToken();
      return yy_buffer_end;
    }
  }

  public IBuffer Buffer
  {
    get
    {
      return yy_buffer;
    }
  }

  public uint LexerStateEx
  {
    get
    {
      return yy_lexical_state;
    }
  }

  public int LexemIndent { get { return 7; } }
  public int EOFPos { get { return yy_eof_pos; } }

  private void LocateToken()
  {
    if (myCurrentTokenType == null)
    {
      myCurrentTokenType = _locateToken();
    }
  }

  private TokenNodeType makeToken(TokenNodeType type)
  {
    return myCurrentTokenType = type;
  }
}

Most of these methods and properties are self explanatory, but it's worth looking at a few in more detail.

All of the properties call LocateToken to ensure that there is a current token. If there isn't, LocateToken will call the lexer's tokenizing function (often created with the custom name of _locateToken, as specified in the directives). The tokenizing function will find the next token, and return it. Some of the built-in lexers will also set myCurrentTokenType in the rule actions, or indirectly by calling makeToken from the rule action. This isn't necessary, as the LocateToken method will update the field.

The ILexer.Advance method first sets the current token type to null and then calls LocateToken. By clearing the current token, it ensures that the _locateToken is called, and the next token is found, returned and the current position in the file is updated.

The CurrentPosition property creates an instance of TokenPosition that includes all of the internal lexer variables (start position, end position, current index and state, as well as the current token type). It will save and restore state such that lookahead can work. The IIncrementalLexer.Start method does a similar thing, resetting the internal variables based on the parameters to the method.

The LexemIndent property returns the magic number 7. This value is used during incremental lexing to give 7 characters of context when resuming lexing.

Unicode

The CsLex utility ships with a file called Unicode.lex, which can be included into a lexer input file using the %include Unicode.lex directive. It provides a set of macros that expand to a set of character classes that match known Unicode character categories. These can be very useful for creating rules that understand Unicode, and don't have issues with Unicode based whitespace.

UNICODE_CN=[\u0370-\u0373...]
UNICODE_LU=[\u0041-\u005A...]
UNICODE_LL=[\u0061-\u007A...]
UNICODE_LT=[\u01C5\u01C8...]
UNICODE_LM=[\u02B0-\u02C1...]
UNICODE_LO=[\u01BB\u01C0...]
UNICODE_MN=[\u0300-\u036F...]
UNICODE_ME=[\u0488-\u0489\u06DE\u20DD-\u20E0\u20E2-\u20E4]
UNICODE_MC=[\u0903\u093E...]
UNICODE_ND=[\u0030-\u0039...]
UNICODE_NL=[\u16EE-\u16F0\u2160-\u2182\u3007\u3021-\u3029\u3038-\u303A]
UNICODE_NO=[\u00B2-\u00B3...]
UNICODE_ZS=[\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
UNICODE_ZL=[\u2028]
UNICODE_ZP=[\u2029]
UNICODE_CC=[\u0000-\u001F\u007F-\u009F]
UNICODE_CF=[\u00AD\u0600...]
UNICODE_CO=[\uE000-\uF8FF]
UNICODE_CS=[\uD800-\uDFFF]
UNICODE_PD=[\u002D\u058A...]
UNICODE_PS=[\u0028\u005B...]
UNICODE_PE=[\u0029\u005D...]
UNICODE_PC=[\u005F\u203F...]
UNICODE_PO=[\u0021-\u0023...]
UNICODE_SM=[\u002B\u003C...]
UNICODE_SC=[\u0024\u00A2...]
UNICODE_SK=[\u005E\u0060...]
UNICODE_SO=[\u00A6-\u00A7...]
UNICODE_PI=[\u00AB\u2018...]
UNICODE_PF=[\u00BB\u2019...]

Alternatives to CsLex

CsLex is intended for C# projects - it generates a C# file. ReSharper does not provide any utilities for generating lexers in any other language. However, as long as the lexer implements ILexer or IIncrementalLexer, and returns tokens that are singleton instances of TokenNodeType, then the implementation can be in any language, or using any tool. The lexer interfaces can be implemented as partial classes, or in standalone classes that delegate to the generated lexer.

Modifications to the original CsLex

The version of CsLex that ships with the SDK has a number of differences to the original version. These include:

Support for Unicode, by specifying the %unicode directive. This generates lookup tables for all Unicode characters, rather than just 8-bit character sets, and also handles Unicode escape characters.
Added a preprocessor, using the %include directive. The content of the included file is inserted into the input file text before it is processed, and treated as though it were always part of the file, at the location of the %include directive. The specified file name can be a rooted path name, or relative. If relative, it is first resolved against the input file, and if not found, then against the location of the CsLex utility. This allows for loading the Unicode.lex file that ships with the SDK.
The generated class is now partial.
Added the %private directive. The class will be declared as internal, and the generated constructors will be private. Creating an instance of the class is handled by methods in the non-generated partial class. The %private directive has higher precedence over %public.
Added the %virtual directive, which makes the main lexing function virtual.
ReSharper's text buffers (IBuffer) are used instead of System.IO.TextReader and a char[].
The actions for each matching rule are now implemented in a switch statement rather than an array of delegates, for memory and performance reasons.
The %char and %line directives that enable character and line counting have been removed. This means the yychar and yyline variables are not available.
Each state is written as a protected variable, rather than private.
Some implementation methods are no longer available, such as yy_advance, yy_mark_start and yy_mark_end.
Minor changes to fix small issues with the code.
Standard This code was generated by a tool header to indicate that the file is generated, causing ReSharper to not analyser the contents.

CsLex COPYRIGHT NOTICE, LICENSE AND DISCLAIMER

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that the name of the authors or their employers not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

The authors and their employers disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall the authors or their employers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.

Last modified: 04 July 2023