ANTLR4 project with Maven – Tutorial (episode 3)

[Episode 1] [Episode 2] Now for something completely different. During the preparation of episode 3 I changed my mind and thought that the best way to approach the remaining issues (self embedding and AST) was to embrace a classic. So I decided to fall back to the good old arithmetic expressions, since they have proven to be the best didactic test bed for me. I developed a new project from scratch that you can find here on Github. It’s nothing more that an interpreter of arithmetical expressions, built using ANTLR4 and Maven (of course). The projects also contains an object model for an Abstract Syntax Tree (AST) that fits my (our) needs. Please keep in mind that the focus of this episode is on how to define an object model and build an AST for the language. I will take for granted a lot of things that do not fall into this topic. If anything is not clear, well, there is always a comment section… oh and by the way… Disclaimer: This is not a proposal for a best practice. This is just a sharing of a toy project that I made up because I could not find anything similar.

The (labeled) grammar

The grammar that will be used is just a minor variation of the example found here.

grammar Arithmetic;

program : expression ;

expression
	: expression ('*' | '/') expression #Multiplication
	| expression ('+' | '-') expression #AlgebraicSum
	| term #AtomicTerm;

term: realNumber #Number
	| '(' expression ')' #InnerExpression;

 realNumber : NUMBER ('.'NUMBER)?;

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

NUMBER : [0-9]+ ;

What I added is:

  1. Labels (those identifiers preceded by ‘#’);
  2. The missing operands (in the example there are only sum and multiplication).

Labels allow you to name a production out of a set of alternatives, so that you can discriminate among them in the visitors. Let’s take the rule for expression at line 5 as an example: it will produce a visitor interface that contains the following signatures, rather than just a single ‘visitExpression’ method:

        /**
	 * Visit a parse tree produced by the {@code AlgebraicSum}
	 * labeled alternative in {@link ArithmeticParser#expression}.
	 * @param ctx the parse tree
	 * @return the visitor result
	 */
	T visitAlgebraicSum(ArithmeticParser.AlgebraicSumContext ctx);
	/**
	 * Visit a parse tree produced by the {@code Multiplication}
	 * labeled alternative in {@link ArithmeticParser#expression}.
	 * @param ctx the parse tree
	 * @return the visitor result
	 */
	T visitMultiplication(ArithmeticParser.MultiplicationContext ctx);
	/**
	 * Visit a parse tree produced by the {@code AtomicTerm}
	 * labeled alternative in {@link ArithmeticParser#expression}.
	 * @param ctx the parse tree
	 * @return the visitor result
	 */
	T visitAtomicTerm(ArithmeticParser.AtomicTermContext ctx);

Remember that either you label all the alternatives in a production or none: ANTLR does not allow you to name only a few. The ‘expression‘ production introduces two more concepts: self embedding and left recursion. Self embedding happens when a symbol is capable of producing itself. In this case expression does this both directly (as in the AlgebraicSum and Multiplication alternatives) and indirectly (through the term production, with the alternative named InnerExpression). While self embedding is perfectly natural in a programming language (in fact you cannot express nested arithmetical expression without it) and it is, in fact, the characteristic that distinguishes context free languages from regular expressions, left recursion may be a problem for LL parser like the one we are going to build. With JavaCC, for example, you would not be allowed to write a production like expression : expression ‘+’ expression. ANTLR, on the other hand, is able to recognize and resolve direct left recursion. As a desirable consequence of the strategy adopted by ANTLR, the precedence of the productions (which means the resulting precedence of the arithmetical operators) is given by the order in which the alternatives are listed. For example, in our production the Multiplication alternative will have a higher precedence than AlgebraicSum and a string like:


1 + 2 * 3

will produce a parse tree that looks like this (edit — snapshot of the ANTLR plugin for Eclipse):

parsetree1

You have to be aware of this behavior, otherwise you could end up doing the error I did in the first version of my grammar. Initially I wrote the productions in the following manner:

/* Warning: BROKEN GRAMMAR! Do not do this */
expression
	: expression '+' expression #Sum
	| expression '-' expression #Difference
	| multiplicativeExp #Term;

multiplicativeExp
	: multiplicativeExp '*' multiplicativeExp #Multiplication
	| multiplicativeExp '/' multiplicativeExp #Division
	| NUMBER ('.'NUMBER)? #Number
	| '(' expression ')' #InnerExpression;

In this version Sum has a higher precedence than Difference, and Multiplication has precedence over Division: this is not what we want to do.

In this instance, if you parse:


2 + 3 - 5 + 6

you get:

parsetree2

Not quite the right tree.

A “naive” interpreter

The first attempt to interpret the expressions will be a visitor that operates on the concrete parse tree. I call it “naive” because you do not need to define an AST object model: you just traverse the parse tree and “manually” skip all the productions and terminals that you do not care about. The implementation of such a visitor is in the NaiveInterpreterVisitor class. To get an idea, you visit the nodes in the following way:


	public Double visitAlgebraicSum(AlgebraicSumContext context) {
		String operand = context.getChild(1).getText();
		Double left = context.expression(0).accept(this);
		Double right = context.expression(1).accept(this);
		if(operand.equals("+")){
			return left + right;
		}
		else if(operand.equals("-")){
			return left - right;
		}
		else
		{
			throw new ArithmeticException("Something has really gone wrong");
		}
	}

Here we first find what the operator is (remember that the AlgebraicSum node could store a sum or a difference): to do that (line 2) we get the second child of the current node (we know it must be a terminal) and then we get its text representation. In order to find out what the values of the left and right operands are, we pass the visitor (this) into their ‘accept’ method. Note that every “Context” object has convenient methods that correspond to the parsing functions of the related production, so that we can skip the noise of the terminals we do not like to interpret and go straight to the child nodes we care about. As a further example, consider the following method:

	@Override
	public Double visitInnerExpression(InnerExpressionContext context) {
		return context.expression().accept(this);
	}

Knowing how the corresponding production is defined

term: realNumber #Number
	| '(' expression ')' #InnerExpression;

we can skip the ugly parenthesis and immediately pass the visitor into the meaningful subtree.

The AST object model

Depending on the application needs, you may want an AST to represent the parsed strings. Unfortunately, starting from version 4, ANTLR does not provide you anymore with anything to automate this part of the process. Unlike other language recognition tools, it does not generate classes for the AST nodes based on some kind of annotations in the grammar file, nor it makes a parser that is able to build an AST instead of a concrete parse tree (These are things that you can do with JJTree + JavaCC). As I mentioned, It seems to be an aware design decision and this is where we left at the end of the previous episode. In order to work with ASTs, I went through the following steps:

  1. defining an object model for the AST;
  2. writing a visitor of the parse tree that produces a corresponding AST object.
  3. defining a visitor interface for the AST and writing an interpreter that implements that interface.

The object model is very straightforward for our little language. I put the classes in the org.merka.arithmetic.language.ast package. I would also put here a better UML class diagram of the model if I knew a tool that would not take me three hours. Here’s the best that I can do in terms of UML drawing:

ArithUML

Building the AST with a visitor of the concrete parse tree

The approach I took is to use a parse tree visitor to build the AST. You can also use ANTLR listeners instead of visitors, I think it depends much on your needs and personal taste. The builder visitor is implemented in the ASTBuilderVisitor class. It traverses the tree much like the naive interpreter does, skipping all the terminals and the productions that are not meaningful for the abstract syntax. This is an example of an AST node construction:

	@Override
	public ArithmeticASTNode visitAlgebraicSum(AlgebraicSumContext context) {
		String operand = context.getChild(1).getText();
		ArithmeticASTNode leftOperand = context.expression(0).accept(this);
		ArithmeticASTNode rightOperand = context.expression(1).accept(this);
		if(operand.equals(PLUS)){
			return new SumASTNode(leftOperand, rightOperand);
		}
		else if (operand.equals(MINUS)){
			return new DifferenceASTNode(leftOperand, rightOperand);
		}
		else{
			throw new ArithmeticException("Something has really gone wrong: operand '" + operand +"' comes as a complete surprise");
		}
	}

As you can see, it’s almost identical to its interpreter counterpart.

The AST based interpreter

Finally we can define a Visitor interface based on the AST:

public interface ArithmeticASTVisitor {

	public Number visitDifference(DifferenceASTNode differenceNode);
	public Number visitDivision(DivisionASTNode divisionNode);
	public Number visitMultiplication(MultiplicationASTNode multiplicationNode);
	public Number visitSum(SumASTNode sumNode);
	public Number visitNumber(NumberASTNode numberNode);
}

Nothing weird here: there is just one method for each concrete node type. We can now write the interpretation methods in a concrete visitor, here’s an extract:

	@Override
	public Number visitSum(SumASTNode sumNode) {
		Number leftOperand = (Number)sumNode.getLeftOperand().accept(this);
		Number rightOperand = (Number)sumNode.getRightOperand().accept(this);
		return leftOperand.doubleValue() + rightOperand.doubleValue();
	}

See the class EvaluationVisitor for the complete code. A side note: an obvious improvement would be to rewrite the grammar so that Sum and Difference, as well as Multiplication and Division, are in different productions, thus producing nodes of different type in the parse tree. That way we could avoid the ugly if – else in the visitSum and visitMultiplication method.

ANTLR4 project with Maven – Tutorial (episode 2)

[The reference tag for this episode is step-0.2.]

At the end of the previous episode we have been able to feed sentences to the parser and find out if they are valid (i.e. belong to the language) or not. In this post I will show you how you can implement a visitor to interpret the language.

At the end of every successful parsing operation, the parser produces a concrete syntax tree. The function parse<StartSymbol> of the parser returns an object of type <StartSymbol>Context, which represents the root node of the concrete syntax tree (it is a structure that follows the classic Composite pattern). In our case, take a look at the testJsonVisitor test method (forget about the “Json” part of the name, the method is named like this by mistake):


 @Test
 public void testJsonVisitor() throws IOException{
    String program = "sphere 0 0 0 cube 5 5 5 sphere 10 1 3";
    TestErrorListener errorListener = new TestErrorListener(); 
    ProgramContext context = parseProgram(program, errorListener);
 
    assertFalse(errorListener.isFail());
 
    BasicDumpVisitor visitor = new BasicDumpVisitor();
 
    String jsonRepresentation = context.accept(visitor);
    logger.info("String returned by the visitor = " + jsonRepresentation);

...
 
 }

After parsing the string, the test method instantiates a visitor object (BasicDumpVisitor) and provides it with the ProgramContext object as the input.

Let’s take a closer look at the visitor. If you use the -visitor option when calling the preprocessor, ANTLR, alongside with the parser, generates for you a basic interface for a visitor that can walk the concrete syntax tree. All you have to do is implementing that interface and make sure that the tree nodes are visited in the right order.

I created the BasicDumpVisitor class, the simplest visitor that I could think of: it walks the tree and creates a string (also known as “a program”) that, once parsed, gives back the visited tree. In other words it just dumps the original program that created the current contrete syntax tree.

The base visitor interface is declared as follows:


public interface ShapePlacerVisitor<T> extends ParseTreeVisitor<T>

The name’s prefix (ShapePlacer) is taken from the name of the grammar, as defined in the grammar source file. The interface contains a “visit” method for each node type of the parse tree, as expected. Moreover, it has a bunch of extra methods inherited by the base interface ParseTreeVisitor: see the source code to get an idea, they are quite self-explanatory. I report here one of the “visit” methods as an example. The other ones in the class follow a similar logic:


	public String visitShapeDefinition(ShapeDefinitionContext ctx) {
		StringBuilder builder = new StringBuilder();
		for (ParseTree tree : ctx.children){
			builder.append(tree.accept(this) + " ");
		}
		builder.append("\n");
		return builder.toString();
	} 

The interface is parametric: when you implement it, you have to specify the actual type: that will be the return type of each “visit” method.

So far, I have always written about “concrete syntax tree”. When we deal with interpreted languages, we usually want to manipulate an Abstract Syntax Tree (AST), that is, a tree structure that omits every syntactic detail that is not useful to the interpreter and can be easily inferred by the structure of the tree itself. In the case of our language, if we have, say, the string “sphere 1 1 1”, the parser creates for us a tree that looks like this:

  • program
    • shapeDefinition
      • sphereDefinition
        • SPHERE_KEYWORD
        • coordinates
          • NUMBER [“1”]
          • NUMBER [“1”]
          • NUMBER [“1”]

That is not ideal since, when it comes down to interpretation, we may want to work with something that looks like this:

  • sphereDefinition
    • coordinates
      • 1
      • 1
      • 1

or, depending on your needs, something even simpler, for example:

  • sphere definition
    • (1, 1, 1)

Unfortunately ANTLR 4, unlike its previous versions, does not allow for tree rewriting in the grammar file. As far as I know, this is a precise design decision. So, if you want to work with an AST of your invention, you have to build it yourself, i. e. you have to write the tree node classes and the code that traverses the concrete syntax tree and builds the AST. Hopefully, I will cover this case in the next episode.