Lexing HTML-Like Languages
Lexing an HTML-like language can be complicated - it must handle existing HTML syntax, but also provide support for new syntax, which might be new tags, new tag syntax (
%>), start block delimiters (Razor's
@, which has no closing delimiter) and even nested HTML inside a language construct (Razor again). And it shouldn't have to reimplement HTML.
ReSharper manages this using
HtmlCompoundIncrementalLexer<TLexer>. These are implementations of
IIncrementalLexer that switches between a lexer for the custom syntax and a lexer for standard HTML. The
TLexer generic parameter is the lexer for the custom syntax, and must implement the
Creating the compound lexer
The constructor looks like this:
bufferrepresents the text of the file to be lexed.
baseLexeris an instance of the lexer used to parse the custom HTML syntax, and is of the generic type
htmlLexerFactoryis a factory class that is used to create the standard HTML lexer.
Typically, this is called from the
CreateLexer method of a HTML-like language's
A new instance of the lexer for the custom syntax is passed in, as is a new instance of the standard
HtmlLexerFactory. Both classes take a parameter of
myTokenNodeTypes, which a per-language component that derives from
RAW_HTML property is used by the compound lexer to know when to switch between the custom lexer and the standard HTML lexer. See the HTML Token Node Types section for more details on this interface and implementing token node types.
Running the compound lexer
The compound lexer is used just like a normal implementation of
Advance() is called multiple times until the end of file.
The base (custom) lexer drives the lexing of the file - the compound lexer will initially use the tokens of the base lexer. However, if the base lexer encounters a
RAW_HTML token, the compound lexer will switch to the HTML lexer, and use the HTML tokens.
Initially, advance the base lexer.
Update the compound lexer's own state (
ILexerEx.LexerStateEx) by combining the states of the base leer, the HTML lexer and a flag to indicate if the HTML lexer is currently running. This state can be saved and restore for incremental lexing.
Compare the base lexer's current token to the custom language's
If different, use the base lexer's current token.
If the same, reset the (incremental) standard HTML lexer over the span of the
RAW_HTMLtoken and set the flag to indicate the HTML lexer is running.
On subsequent calls to
Advance, if the flag is set, advance the HTML lexer and use that token, otherwise, advance the base lexer.
Once the span of the
RAW_HTMLtoken has been lexed (by the HTML lexer), reset the HTML running flag and start advancing the base lexer again.
HtmlCompoundIncrementalLexer class is essentially the same as
HtmlCompoundLexer except that it also implements
IIncrementalLexer, allowing for incremental lexing. More details on incremental lexing can be found in the topics on Implementing a Lexer, Caching Lexers and incremental parsing.
Custom syntax lexer
The base lexer is intended to create tokens for the custom language syntax. It must implement
TokenTypes property returns the per-language component that implements
IHtmlCompoundTokenNodeTypes, so that the compound lexer can access the
RAW_HTML property, which is the token node type the custom lexer uses to indicate a block of raw HTML. This is a different token node type for each HTML-like custom language.
NStates property is the number of states supported by the custom lexer. In other words, this is the maximum value that will be set in
ILexerEx.LexerStateEx for the custom lexer. It is used when combining the base lexer and HTML lexer states for incremental lexing in the compound lexer.
If built with CsLex, this will be the number of states such as
YYINITIAL used in the lexer definition, and can be implemented like:
Tips for implementing the base lexer
The base lexer should create tokens for the custom language as normal, and should return
RAW_HTMLfor any chunks of HTML.
If the custom language's syntax is valid HTML, let the standard HTML lexer/parser handle it. For example, WebForms'
<asp:Content runat="server" ...is valid HTML, and is lexed and parsed as HTML.
An embedded HTML block should start with valid HTML syntax. The HTML lexer starts with empty state, so the first
RAW_HTMLblock must start with valid HTML.
The HTML lexer state is preserved across
RAW_HTMLblocks, so it is possible to create "nested" blocks of custom syntax inside HTML. For example:<div class="@MyClass" />
This could be implemented as a
RAW_HTMLtoken that spans from the start of the element up to the
<div class=". The HTML lexer will then return tokens for the opening bracket, the element, the attribute name, the equals, the quote symbol and then relinquish control to the base lexer, which will return tokens for the expression -
@MyClass. It can then return
RAW_HTMLfor the closing quotes and the closing element bracket -
" />. Because the HTML lexer state is preserved, HTML lexing will continue as expected, and the HTML tokens are correct.