Julien Ponge is the original developer and maintainer of the IzPack
project. He is a PhD student at the University Blaise Pascal in
Clermont-Ferrand, France and at the Computer School of Engineering at
the University of New South Wales in Sydney, Australia. He started
programming at the age of 11 and still enjoys learning about new
techniques.
I have recently had to build a tool were the users can enter some form of constraints. The constraints are of the form C-Invoke((T1 < 3 && T2 >= 8) || T3 < 3) or M-Invoke(T1 = 3). The idea here is that we have C-Invoke and M-Invoke constraints which contains variables and constants that can be compared. The user can enter these constraints in simple text fields in a GUI. Hence, storing constraints as simple strings could be enough at first sight. However, I also need to be able to do some fancier stuff like computing the negation of a constraint, or the conjunction of two of them.
In this article, I will show you how to build an object model for this constraints language, and how to parse it from plain text.
You have probably had a languages compilation theory lecture during your studies, and you have probably had to grok with the (in)famous Lex and Yacc tools to build simple lexers, parsers and maybe even a compiler for some form of simple language. This is probably why many programmers prefer to half-turn in front of such tools and opt for another way to define a language, like building an XML schema to avoid having to deal with parsing procedures and the likes. If you are an astute programmer, you might also know about the XML programming dilemna, were you end up writing cumbersome XML files instead of using a more simple domain-specific language. The idea of this entry is to show that in many cases, using a lexer / parser generator is not as hard as it seems, and that it can greatly boost the power of your tool and the user experience.
In order to use my constraints language from freeform text entered by the user, I have used ANTLR which is a lexer / parser generator that greatly surpasses Lex / Yacc and their clones. While being developped in Java, ANTLR can generate code in C++, Java, C# and Python from the same grammar definition (delta some target-specific tweaks). Instead of showing you every detail, I will show you the steps that I have followed to successfuly implement my constraints language. One note about the tooling support: I recommend the ANTLR plug-in for Eclipse which will give you an instant feedback on the potential problems in your grammar.
The first thing to do is to define an object model for the constraints language. A tree representation is definatly natural, so I have defined two interfaces. The first one is called IConstraintNode which is the most general one and defines one method: negate() which is expected to return a IConstraintNode for the negation of the constraint. Next, I have defined IRootConstraintNode which inherits from IConstraintNode but hads getLeftChild() and getRightChild() to retrieve the left and right children of the node. Then, there is a concrete class for each kind of element that can constitute a node in the tree that represents a constraint. Here is an illustration:

The nodes in yellow inherit from IRootConstraintNode while the others inherit from IConstraintNode. Once you have done this, you can validate your object model by creating a strong JUnit-based tests suite. One tip: override the toString() methods in your nodes classes, this will make the testing easier, like in this extract from my tests suite:
VariableNode var = new VariableNode("T1");
ConstantNode cst = new ConstantNode(5);
ComparisonNode node = new ComparisonNode(ComparisonNode.LESS, var, cst);
CInvokeNode ciNode = new CInvokeNode(node);
TestCase.assertEquals("C-Invoke(T1 < 5)", ciNode.toString());
The next step is to be able to parse a textual representation of a constraint, and instantiate the corresponding nodes tree in our object model. We need to start with the lexer. Briefly, the lexer is there to recognize elements such as operators, variables or constants in the text. Here is my lexer:
class TemporalConstraintLexer extends Lexer;
options
{
k = 2;
defaultErrorHandler = false;
}
CINVOKE : "C-Invoke";
MINVOKE : "M-Invoke";
LPAREN : "(";
RPAREN : ")";
CONST : ('0'..'9')+;
VAR : ('a'..'z' | 'A'..'Z' | '_')
('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
COMPOP : ("=" | "!=" | "<=" | "<" | ">=" | ">");
BOOLOP : ("&&" | "||");
WS : (' ' | '\n' | '\r' | '\t')
{
$setType(Token.SKIP);
};
ANTLR generates LL(k) lexers and parsers. Without entering the academic details, k defines the look-ahead depth. With k = 2, the lexer will be able to distinguish between < and <=. Indeed, unlike Lex, ANTLR doesn't put a priority in the declarations. With k = 1 (which is the default), ANTLR will issue a warning. In practice, any warning from ANTLR must be fixed, else you will always have problems at some point
defaultErrorHandler is useful in the final version of your grammar, so that the application in which you embed your parser can catch errors when the input is invalid. Finally, $setType(Token.SKIP); allows to define elements to skip like whitespaces, returns or tabs.
Once this is done, you need the parser, which will build a representation in Abstract Syntactic Tree (AST) of your input. Here's mine:
class TemporalConstraintParser extends Parser;
options
{
buildAST = true;
k = 2;
defaultErrorHandler = false;
}
constraint : ciConstraint | miConstraint;
miConstraint!
:
func:MINVOKE param:miConstraintParam
{
#miConstraint = #(func, #param);
}
;
miConstraintParam : LPAREN! miExpr RPAREN!;
ciConstraint!
:
func:CINVOKE param:ciConstraintParam
{
#ciConstraint = #(func, #param);
}
;
ciConstraintParam : LPAREN! ciExpr (BOOLOP^ ciExpr)* RPAREN!;
miExpr : VAR COMPOP^ CONST
| CONST COMPOP^ VAR
| LPAREN! miExpr RPAREN!;
ciExpr : VAR COMPOP^ CONST
| CONST COMPOP^ VAR
| LPAREN! ciExpr (BOOLOP^ ciExpr)* RPAREN!
;
Everything here is straightforward. ^ defines a root tree. As an example, a ciExpr can have a COMPOP as a root, and VAR + CONST as children. You do need to understand the way AST are built in ANTLR by reading the manual, else you will have nasty bugs as I had. Indeed, CINVOKE as to be a root of the rest in my grammar, but I always ended with an operator beein the topmost one. I have had to override the default AST construction behavior for ciConstraint. I give identifiers for the parts of the expression, and I build the tree representation that I want by using a tree syntax of the form #(root child1 child2 ...). What you should keep in mind here is that the default AST construction defines the topmost root as the last element with ^.
The analogy with Lex / Yacc stops here. From there you could use the Visitor design-pattern and walk the AST obtained from the parser. However this is can be difficult to write, and ANTLR provides a way to generate tree parsers. A tree parser is in fact a visitor, but which operates on tree patterns. Again, the tree notation will be used here. Code is worth thousands words, so here it is:
class TemporalConstraintTreeWalker extends TreeParser; terminal returns IConstraintNode node { node = null; } : v:VAR { node = new VariableNode(v.getText()); } | c:CONST { node = new ConstantNode(Integer.parseInt(c.getText())); } ; rootNode returns IRootConstraintNode node { IConstraintNode left; IConstraintNode right; node = null; } : #(o1:COMPOP left=terminal right=terminal) { if (left instanceof VariableNode) { node = new ComparisonNode(o1.getText(), (VariableNode) left, (ConstantNode) right); } else { node = new ComparisonNode(o1.getText(), (ConstantNode) left, (VariableNode) right); } } | #(o2:BOOLOP left=rootNode right=rootNode) { node = new BooleanNode(o2.getText(), (IRootConstraintNode)left, (IRootConstraintNode)right); } ; constraint returns IConstraintNode constraint { constraint = null; IRootConstraintNode root; } : #(CINVOKE root=rootNode) { constraint = new CInvokeNode(root); } | #(MINVOKE root=rootNode) { constraint = new MInvokeNode(root); } ;
The object model is instanciated by recognizing patterns and elements in the AST. It's as simple as that, and from this point, you can parse constraints, manipulate them, and get back to a canonical textual representation 
I will finish by insisting on the need to define a proper JUnit test suite to validate the generated parser. You will also probably find it useful to have a small class to manually enter text and see how it is parsed. Here is mine:
public static void main(String args) { // Ex: C-Invoke|| (T3 = 7))) // Ex: C-Invoke
// Ex: M-Invoke(T1 = 3) // Ex: C-Invoke(T1 < 3) try { TemporalConstraintLexer lexer = new TemporalConstraintLexer(System.in); TemporalConstraintParser parser = new TemporalConstraintParser(lexer); parser.constraint(); CommonAST tree = (CommonAST) parser.getAST(); System.out.println(tree.toStringList()); TemporalConstraintTreeWalker walker = new TemporalConstraintTreeWalker(); IConstraintNode constraint = walker.constraint(tree); System.out.println(constraint); } catch (Exception e) { e.printStackTrace(); } }
I hope that this will have convinced you that lexer and parser generator are not always evil to build, and that you have probably many use cases in your own applications. By the way, the grammar code here is probably not perfect, but at least it works fine for me
If you are interested in more elaborated languages, you can think about producing Java bytecode by using an ASM library such as ObjectWeb ASM.