Alex and Happy: Parsing comments and strings

Aug. 17, 2015

This is an excerpt from my ongoing book project on Alex and Happy — which are Haskell tools for building parsers and lexers. The code examples are available here on GitHub.

In this chapter we discuss how to parse two features of programming language grammars: strings and comments.

Given the text

/*
This whole section commented out
/* A sample program */
*/
"Bo\"bcat" cat cat cat
"/*Comment*/" cat cat /*cat*/

we expect our program to tell us that the person called Bo"bcat has three cats and the person called /*Comment*/ has two cats. The names of the cat owners are given as strings delimited by double quote characters, with the escape sequence \" being used for a double quote within a name. /* and */ denote the beginning and end of comments. Comments can span multiple lines and nested comments allowed.

Comments and strings have very different rules of lexical analysis than the rest of the program. Most text within strings and comments are not interpreted. On the other hand escape sequences within strings are interpreted . The nesting of comments poses another problem since a regular expression cannot remember arbitrary depth of nesting and the depth of nested comments must be tracked elsewhere.

Token definitions

Startcodes are the solution provided by alex to this problem. A startcode is just a number with a set of these numbers being attached to each rule in an alex specification. When calling alex to find a token we specify the startcode to use and alex then considers only the rules belonging to that startcode.

A lexer and parser for the language of this chapter is in the startcode/ folder of the Github repository of this book. It uses the monadic interface of alex and happy which is discussed in Chapter 3.

Let’s look at the token defintions in Lexer.x first. The initial definitions are ordinary ones that skip whitespace and define the cat token:

Comment lexing actions

The implementation of the comment lexing actions is similar. The beginning of a comment changes the startcode to commentSC and increments commentDepth. The ending of a comment decrements commentDepth, setting the start code back to 0 when commentDepth reaches 0.

beginComment::LexAction
beginComment _ _ = do
s <- get
put s {lexSC = commentSC,
commentDepth = (commentDepth s)+1}
return Nothing

endComment _ _ = do
s <- get
let cd = commentDepth s
let sc' = if cd==1 then 0 else commentSC
put s {lexSC=sc',commentDepth=cd-1}
return Nothing

The driver

The monadic action readToken in Lexer.x provides the interface with alex. The following line calls alexScan with the startcode saved in the state

   s <- get
case alexScan (input s) (lexSC s) of  

When alex returns AlexToken indicating a match we call the corresponding action, looping back by calling ourselves if the action returns Nothing indicating that no token is generated.

-- [...]
AlexToken inp' n act -> do
let (AlexInput{airest=buf}) = input s
put s{input = inp'}
res <- act n (take n buf)
case res of
Nothing -> readToken
Just t -> return t

The rest of the code is routine. lexer in Lexer.h wraps up readToken in the interface expected by happy. The happy parser itself is in Parser.y. main in Main.hs parses texts read in from the standard input.