July 2, 2010

Zen and the Art of Text Editor Programming

I like technology. I surround myself with computers and electronics, I put a lot of faith in the scientific method as a general solution to life's daily problems, and I believe the pursuit of knowledge is one of the purest aspirations one can have.

I also realize that it can go horribly wrong, which is why when I was idly wandering through the vast information jungle that is the Internet, I was so pleased to stumble across this gem of a program:


It's a text editor. What distinguishes it from a lot of other software is that it takes such a focused approach to solving one particular problem: providing a peaceful, distraction-free environment so that a writer can focus on writing.

Now you're probably thinking, "I thought you were a vi bigot. What gives? Have you renounced your faith and joined the mindless horde of infidels?" The answer is, of course, no – I will probably use vi for code until the day my fingers fall off. Still, I don't always write code. Sometimes I write words (real ones, with punctuation and everything!), like the ones you're reading now. When I write prose, all the little tricks that make vi a great programming editor are somewhat distracting. I want to write. Just me and the words.

This is where OmmWriter excels. It's a full-screen editor with rudimentary features. You can load files and save files. You can choose one of three fonts and one of three font sizes. It only edits text files, so you cannot format anything. You can adjust the writing area, and you can scroll up and down if necessary. Finally, you can change the background image and sound effects. Now there are probably two questions in your head: "why is he listing features?" and "sound effects!?" To answer the first, it's because I just enumerated ALL of the features. All of them. It is a wonderfully minimalist editor. To answer your second question, don't knock it until you've tried it. I really thought that the sound/background thing was just a huge art gimmick to get publicity for their program, but I've been writing on it all day, and I actually find it really effective. It doesn't distract, and it puts me in a great, relaxed frame of mind.

So in conclusion, here's to you, artists. The unlikely combination of a bit of programming skill, a light brush, and a very focused vision for the right way to write has produced a wonderful little tool and won me over completely.

...oh, and I should probably also mention that because I am an unabashed terminal fanatic, I couldn't stand not being able to run OmmWriter as a command-line tool. So I fixed that little oversight: omm. You're welcome!

June 3, 2010

Stupid Parser Tricks, Part 1: Two Lexers, One Token Stream

So lately I’ve been working with ANTLR, a lexer/parser generator by Terrence Parr at UCSF. It’s been pretty enjoyable, for the most part. There have been a couple of times where I wanted to pull my hair out, but overall, it has really saved me a lot of time and effort in my attempt to write a DSL for some folks at work.

Anyways, in the course of working with it, I stumbled across a couple of neat hacks that I thought I’d share. The first is how to write a hybrid lexer for an embedded DSL. I’m not the first to try something like this: Parr’s island-grammar example (examples-v3.tar.gz) shows how to traverse a hybrid grammar but stops short of merging the ASTs, and there’s a pretty hairy discussion of a more complicated scenario on the ANTLR wiki, but neither of these enable us to make two lexers transparently behave like one (which makes the programmer’s life easier and the downstream code cleaner, if you can get away with it).

Basically, the recognizer classes that ANTLR builds for a grammar are self-contained enough to be called recursively without exploding. The island-grammar example I linked above does this, and it’s a very clever feature. What it doesn’t show you is that it’s possible to splice together the two token streams. The only tricky part is that by default, ANTLR is not written to handle recording multiple tokens for a single lexer rule (ostensibly for efficiency reasons). This means that when the lexer for the embedded language is invoked, we need to modify the base recognizer to handle the deluge of tokens that is produced. It’s not a difficult fix -- we just add a token buffer in-line with the function responsible for passing tokens up the chain and modify the emit function to feed the buffer instead. Once that plumbing is in place, we drop the embedded lexer in, collect the tokens, and feed them one by one to the outer lexer. The driver simply calls the outer lexer and the parser never knows the difference.

There’s not enough space to walk through the code, but I’ve provided a python implementation below. Enjoy!

stupid_parser_tricks.tgz (tarball)
stupid_parser_tricks (individual files)

(caveat emptor: if you don’t have lexer-level syntax for delineating the embedded language, you’re out of luck with this method. That’s where the more complicated scenario I mentioned arises.)