June 3, 2010

Stupid Parser Tricks, Part 1: Two Lexers, One Token Stream

So lately I’ve been working with ANTLR, a lexer/parser generator by Terrence Parr at UCSF. It’s been pretty enjoyable, for the most part. There have been a couple of times where I wanted to pull my hair out, but overall, it has really saved me a lot of time and effort in my attempt to write a DSL for some folks at work.

Anyways, in the course of working with it, I stumbled across a couple of neat hacks that I thought I’d share. The first is how to write a hybrid lexer for an embedded DSL. I’m not the first to try something like this: Parr’s island-grammar example (examples-v3.tar.gz) shows how to traverse a hybrid grammar but stops short of merging the ASTs, and there’s a pretty hairy discussion of a more complicated scenario on the ANTLR wiki, but neither of these enable us to make two lexers transparently behave like one (which makes the programmer’s life easier and the downstream code cleaner, if you can get away with it).

Basically, the recognizer classes that ANTLR builds for a grammar are self-contained enough to be called recursively without exploding. The island-grammar example I linked above does this, and it’s a very clever feature. What it doesn’t show you is that it’s possible to splice together the two token streams. The only tricky part is that by default, ANTLR is not written to handle recording multiple tokens for a single lexer rule (ostensibly for efficiency reasons). This means that when the lexer for the embedded language is invoked, we need to modify the base recognizer to handle the deluge of tokens that is produced. It’s not a difficult fix -- we just add a token buffer in-line with the function responsible for passing tokens up the chain and modify the emit function to feed the buffer instead. Once that plumbing is in place, we drop the embedded lexer in, collect the tokens, and feed them one by one to the outer lexer. The driver simply calls the outer lexer and the parser never knows the difference.

There’s not enough space to walk through the code, but I’ve provided a python implementation below. Enjoy!

stupid_parser_tricks.tgz (tarball)
stupid_parser_tricks (individual files)

(caveat emptor: if you don’t have lexer-level syntax for delineating the embedded language, you’re out of luck with this method. That’s where the more complicated scenario I mentioned arises.)