Regular expressions

Regular expressions language overview

Introduction

Regular expressions language – the language that greatly simplifies text manipulation – is one of the most used domain specific languages today. Almost every developer has used it at least once. Some languages, like Perl and Python, have built in support for it; some, like Java, use it through libraries. Java, the language which we use to implement MPS, does not have language level support for regular expressions, so it was natural for us to implement DSL for them, so we would be able to use a DSL instead of a regular expression library. This language is a good example of an MPS language. Having read this introduction, you will be able to understand how to create and use languages in MPS.

Examples

We assume you have MPS already installed.
This document uses many examples. You can find them in a regular expression language project (%MPS_HOME%/platform/regexp) under the jetbrains.mps.regexp.examples solution:

worddav54fd8999f157b75ecce5d51ca9bff681 png

Language overview

Let's take a look at a simple regular expression application. Suppose we want to get a user name and domain name from an email address. Here is code that prints out a user name and a domain name by analyzing an email address with a regular expression (you can find this example in the EmailExample class):

worddav6357b7cf6d2db641dc49ad95d2b20ce6 png

The regular expression that is used in this match regexp statement does the following. First, it reads one or more word characters (\w+) and saves them in a "user" variable. After that, it reads the "@" character. Then we read a list of words which are separated by a period ("." character) and save it in a domain variable (\w+(.\w+)). If a match is found, the program prints out user and domain to System.out.
Here is a syntax tree for this example's regular expression:

worddave83ccb1d63a90d30c3ac013f61241197 png

Language structure

When we create a language in MPS, we usually start by defining its abstract syntax. Abstract syntax in MPS is called language structure. To do this, we use a structure language. Structure language is an XML Schema counterpart from XML language, or a DDL counterpart from SQL. Let's take a look at the regular expressions language structure.

Overview

The MPS regular expressions language contains several parts:

Regular expressions: concepts used to specify regular expressions. They include concepts for string literals, symbol classes, and "or" and "sequence" regular expressions.
BaseLanguage (BaseLanguage is a Java-like language, used internally by MPS as a target language for generators) integration: this part includes concepts used to embed regular-expressions-related code into BaseLanguage. For example, it includes MatchStatement, ReplaceStatement, and SplitStatement.
Regular expressions library support. When we work with regular expressions, we want to reuse them, and so we created special concepts for this task.

Regular expressions

All regular expressions concepts in our language are placed into "Regexp" folder in its structure model:

worddavc3eea13ac43a8310f50d2743e3c07435 png

Let's consider them in detail. We have a single base concept for all of them: Regexp:

worddavb1c8d5e9a555626d4a7c6f326b4391ba png

It is derived from BaseConcept concept. All MPS concepts are derived from it. This concept also has the abstract concept property, which means that it is created to form a concepts hierarchy, not to be used in language to define regular expressions. It is similar to the 'abstract' modifier in Java classes.
Let's consider the concepts that are derived from it. You can see them in a hierarchy view. You can see this view by pressing Ctrl + H on the concept declaration. For the Regexp concept, we will see the following:

worddav641bbd1e055f8a1257d7e09e8545a4b5 png

StringRegexp represents an arbitrary string which can be matched against text (you can find all examples of regular expression that we consider in this section in the Regexps root node):

worddav60b32b2dae9e635eaec0472fce900e33 png

Let's take a look at its concept declaration (you can quickly navigate to a concept declaration by pressing Ctrl + Shift + S when an instance of a concept is selected in an editor):

worddavc394a4e587c8986c177d7a7e08981894 png

In its declaration, we see a property text with a string type, which is used to store text that will be shown in the editor. Also, this concept declares a concept property "alias." Concept properties differ from simple properties. Simple properties correspond to Java instance fields, and concept properties correspond to Java static fields. The value of a concept property alias will be shown in completion menu, when we press Ctrl + Space:

worddavf9496c8f18b0eb7b37640d05e82a8bf6 png

Binary regular expressions are created to represent regular expressions that combine two different regular expressions into one. BinaryRegexp concept is declared as abstract and has two concrete sub concepts: OrRegexp and SeqRegexp. Here are examples of their instances:

worddav293aaafcd17556c2b1f82ba50ef99148 png

worddava8acebee034eb28ab80b57e485b5aebd png

Here is its concept declaration:

worddav37de03cf815e63b6dab3fbf5daf43f86 png

It defines two links: one to store the left part and another to store the right part. The word 'aggregation' means that the regular expression under this link will be a part of a declared concept instance. i.e. if we look at the syntax tree, we will see a child regular expression under the parent BinaryRegexp:

worddav4ad0a0fbd213cbdb7725a1851aaf03bf png

Dot regexp represents a regexp which matches any character. LineEndRegexp matches only at the end of a line. LineStartRegexp matches only at the start of a line. ParensRegexp are used to group other regular expressions in order to make an enclosing regular expression more readable.

worddav24e21f44c149abcb3659dfd9d06e62aa png

worddav642ced80a263561d8c469346025b4836 png

worddav94705c1fc94ecb5145b425d14f42afcf png

There are a lot of sets of symbols which are often used, but they are quite verbose to enter. So we have character classes that make it possible to enter [A-Z] instead of (A|B|CZ). We have two kinds of them: negative and positive. Both of them extend abstract SymbolClassRegexp:

worddav802b499d6a7435f7a89accadba5698b3 png

worddava11126417379b10b2d63b7b458305019 png

Many of these character classes are used in several places, so they can be referenced in a simpler way with PredefinedSymbolClassRegexp. Instead of [A-Z] we can write "\w":

worddav133f90af8072c5fb9dfe57c7158c29a9 png

This concept is declared in the following way:

worddav622ddc5598027e3c18f2fbd3f4bf8d24 png

Here we have symbolClass link declaration, which has a reference stereotype (aggregation, which we mentioned above, is also a link stereotype). Reference stereotype means, that an instance of this concept won't contain the referenced node as a child. Instead the referenced node can be stored in any place in the model.
Also we have a lot of different UnaryRegexps which are derived from an abstract concept UnaryRegexp. They include +, * and other regexp operations:

worddav302fe26fe89dcd25ad86915228449271 png

worddav196774bebcf4832840530e8a598b777a png

When we work with a text it is often useful to remember some match, and reference it later. To facilitate this task we have MatchParensRegexp that remembers a string which it matches, and MatchVariableReferenceRegexp that references a string matched before. The following code matches a pair of the same xml tags with a text inside it:

worddav480b88108c470d3412e1c3ca338bf95d png

BaseLanguage integration

Regular expressions have a little use if they can't be integrated in the BaseLanguage code. So in regular expressions language we have special concepts which make it possible to write regular-expression-related constructs in a program which is written in BaseLanguage.
If you want to add new constructs to BaseLanguage you usually extend either Expression or Statement concept from BaseLanguage. Expression concept represents expressions like "1+2", "a == b". Statement concept represents control structures like "if() { }", "while() { }". In the regular expressions language we create both new expressions and statements.
Let's first take a look at the statements and than at the expressions:
MatchRegexpStatement is used when you want to check whether a specified string matches a regular expression (you can find the examples for this section in BaseLanguageIntegration class in jetbrains.mps.regexp.examples model):

worddav7ca0d21a461360dc9e35c6d2732b1467 png

We have an interesting feature here: you can reference named matches in the MatchRegexpStatement block. These match variables work in other statements which are defined in the regular expressions language.
FindMatchStatement checks whether a specified string contains a match for a specified regular expression. It is similar to MatchRegexpStatement.

worddav49147cbc3846c222907f8ab55c1ed548 png

ForEachMatchStatement allows you to iterate over all matches of a specified regular expression in a specified string:

worddavb00b56a3a457bcfa3bd5c194d9f07a2a png

When we work with a string, we often want to replace all matches of a regular expression with a specified text. In regular expressions language you can do this with the help of ReplaceWithRegexpExpression:

worddav0324d85e995d77bce18cb1796cbb79a9 png

It is also often practical to split a string with some regular expression. For example, to extract parts of a string which are separated by one or more whitespace symbols we can write this SplitExpression:

worddav8693e20963fbd910cdc6f17e65500d23 png

When we reference a match in a block, the MatchVariableReference concept is used. It is also derived from the Expression concept.

Library support

When we work with regular expressions, we want to use some of them in many places. To define these reusable regular expressions, we have a special concept – Regexps. It contains zero or more named regular expressions:

worddavb1831b4d7d4be636b5f539589a7abd8c png

Accessory models

In many languages we have the following problem: we have a lot of very similar entities, which can be used in any model that is written with this language (like predefined symbol class regular expression). We could create a concept for every such entity. But MPS has a better solution: you can create a special model, called an accessory model, and declare all these entities in it with your language.
We have the PredefinedSymbolClass concept which is used to declare a symbol class. Also, we have the PredefinedSymbolClasses container concept, which contains these symbol classes. If you look into the accessory model of the regular expressions language, you will see this:

worddav26add762091c9a627c43395111298b8a png

Editor

After defining the concept structure, we usually create an editor for it. To accomplish this task, we use the editor language. It is quite straightforward to use, so let's consider its most common constructs.
All editor-related code is placed in an editor model. You can find it under a language node in a project tree:

worddavc1136b6cec7f424fb0462579c99c0799 png

Here is an editor of StringLiteralRegexp:

worddav9af071cf04fc04a21f5fda7ba65302a4 png

It contains a horizontal collection, the container which you might use to group other constructs inside it, and {text}, which is used to include an editor for an instance property.
Here is an editor for MatchVariableReferenceRegexp:

worddavde451156ae66510919af6563b7264e39 png

It also consists of a horizontal collection, but this time we have a richer set of constructs inside it. "(ref" and ")" are constants, which always contain the same text. "%match%->{name}" is used to reference the property "name" of match link's target.
Here is an editor for Regexps:

worddav79e6df4b7ef387a833fa176607b56cd2 png

It contains a vertical collection with nested horizontal collections. Also, it contains a "(> %regexp% <)" construct. It is used to include editors for all the nodes in the role "regexp".

Scopes

After declaring references in structure, we have default substitute menus for them. These default menus include all the nodes of a reference type in the current model and all of its imported models. Sometimes it works, but sometimes we have to narrow down the scope of these menus (For example, if you have a lot of match variables named "name" in different parts of a model, it's a good idea to follow the Java scoping rules for these variables.) To handle this task, we have constraints language's scopes.
Scopes are placed in a constraints model under a language node:

worddavaeb22e88a30788c4c39df5434c19093d png

Let's consider a scope for MatchVariableReference:

worddavc378d6ee860a2581a4e9dbdc23fae9d8 png

Scope consists of a referent set handler, a scope condition (labeled "can create"), and a scope constructor. Usually, only a scope constructor is specified. Scope constructor has to return an object that implements the ISearchScope interface. Usually, an instance of the class SimpleSearchScope is returned; it has a constructor which takes a list of nodes, i.e. we return a list of nodes which are visible in a specified place.

Actions

Default editors in MPS aren't very easy to use. To improve this default behavior, different constructs from the actions language and the editor language can be used.
When we enter code in a text-based language, we usually do it from left to right. We might start from "2", then enter "2+", and finally we might have "2+2". It is also possible to enter code in MPS in this way with the aid of a mechanism called 'right transform.'
To define a right transform action, you have to create a right transform actions root in the actions model and add some right transform actions to it. Let's consider a right transform action from the regular expressions language which transforms one regular expression to the unary regular expression, that is, it transforms "a" into "a+", "a*", and so on (like constraints, editor and structure, you can find the actions model under the language node in your project tree):

worddavced409801e815e076620cb81669d28df png

Each right transform has an applicable concept – the type of concept this action can be applied to. Also, it has a condition and the most important part: a right transform menu. There are different types of right transform menus. The menu on the picture above adds one menu item for each non-abstract UnaryRegexp sub concept. The handler of this menu part transforms an expression into a unary expression.

Type System

Many languages have a type system. It allows you to check a model against it, and can be used to improve editing experience and simplify the generator. For example, if we know the type of a particular expression, we are able to calculate which methods can be applied to it. MPS has a special language for type systems, called HELGINS. In languages with a very simple structure, it's possible to live without it, but when we have a complex language or want to integrate with BaseLanguage, we have to create a type system, at least for BaseLanguage integration concepts.
In HELGINS, types are represented as MPS nodes. So, if you have a sublanguage for types, like BaseLanguage does, you can use it for type checking.
Let's consider a couple of rules from the regular expressions language.

worddavfbd103e572784a29f920f075b0b64b55 png

In this code we define a type called String (String here is an instance of ClassifierType from BaseLanguage, which is used in method parameter types, local variables and other places). To do so, we use the GIVETYPE statement.
Let's take a look at a more complex rule:

worddav1a274d8e26272ef9d52b6c7f7458bfff png

In this rule, we require that an expression that we match against a regular expression in FindMatchStatement be a subtype of String type. We do this by specifying a type equation. The sign ":<=:" denotes a subtype; expression TYPEOF denotes a type of expression in parenthesis.
To calculate types, HELGINS uses a sophisticated algorithm which saves you a lot of time. You don't have to worry about the order in which types are calculated; all you have to do is to specify type equations in typing rules, and HELGINS will solve them for you.
Of course, the rules in our language are very simple, and if you want to know more about HELGINS, you have to take a look at rules in languages like BaseLanguage or the model language.

Generator

Almost any language created with MPS has a generator. Generators in MPS convert the high-level language code into code in a lower level language. The key component of a generator is its mapping configuration. It tells us what to do with a language.
Let's consider a mapping configuration of the regular expressions language:

worddavf77db0d024ca925203b56a14931eb487 png

It contains one mapping rule and several reduction rules. Each rule has an applicable concept; for each instance of this concept, the rule will be applied. Mapping rules create a new root node on each application. A reduction rule replaces a node to which it is applied with a new node. Each rule has an associated template used to create an output node.
Let's take a look at an instance of such a template:

worddavc1fd76f76c7cffb01f1f0e716bd309a7 png

Templates contain MPS code with macros and template fragments.
The code outside of a template fragment is not used during generation, and is used only to create a context for code inside a template fragment. For example, if we know that our code will be placed inside a method with a parameter named node, we might create a method with such a parameter around the template fragment. During generation, MPS will recognize your intention, and this variable will be automatically resolved.
Macros are used to specify variable parts of code. For example, variable matcher on a picture above has a property macro on it. This property macro generates a unique name for this variable, so we will be able to use nested match blocks. MPS has different kinds of macros: different kinds of node macros, property macros, and reference macros. All of these concepts are declared in the jetbrains.mps.TLBase language.

MPS 2020.2 Help

Regular expressions