Alex and Happy: Parsing comments and strings

Aug. 17, 2015

This is an excerpt from my ongoing book project on Alex and Happy — which are Haskell tools for building parsers and lexers. The code examples are available here on GitHub.

In this chapter we discuss how to parse two features of programming language grammars: strings and comments.

Given the text

/*
This whole section commented out
/* A sample program */
*/
"Bo\"bcat" cat cat cat
"/*Comment*/" cat cat /*cat*/

we expect our program to tell us that the person called Bo"bcat has three cats and the person called /*Comment*/ has two cats. The names of the cat owners are given as strings delimited by double quote characters, with the escape sequence \" being used for a double quote within a name. /* and */ denote the beginning and end of comments. Comments can span multiple lines and nested comments allowed.

Comments and strings have very different rules of lexical analysis than the rest of the program. Most text within strings and comments are not interpreted. On the other hand escape sequences within strings are interpreted . The nesting of comments poses another problem since a regular expression cannot remember arbitrary depth of nesting and the depth of nested comments must be tracked elsewhere.

Token definitions

Startcodes are the solution provided by alex to this problem. A startcode is just a number with a set of these numbers being attached to each rule in an alex specification. When calling alex to find a token we specify the startcode to use and alex then considers only the rules belonging to that startcode.

A lexer and parser for the language of this chapter is in the startcode/ folder of the Github repository of this book. It uses the monadic interface of alex and happy which is discussed in Chapter 3.

Let’s look at the token defintions in Lexer.x first. The initial definitions are ordinary ones that skip whitespace and define the cat token:

<0>         $white+           ;
<0>         cat               {plainTok TCat}

Note the <0>. This specifies the startcode for which these rules apply. 0 is the default startcode which we will be using for the main program text. Beware of a common error: if you leave out the startcode at the beginning of a rule then that rule applies to all startcodes. This can bite you when you have a lexer with multiple start codes.

Now for the patterns for strings

<0>         \"              {beginString}
<stringSC>  \"              {endString}
<stringSC>  .                 {appendString}
<stringSC>  \\[nt\"]        {escapeString}

The first rule, which has startcode 0 matches the double quote in ordinary program text that starts a string. stringSC defines a new start code. In the Haskell code generated by alex, stringSC will be defined with a numeric value which can be used in other parts of the program. The action beginString, whose implementation we will discuss below, ensures that later invocations of alexScan use the startcode stringSC.

The second rule, which also matches a double quote but in the context of the startcode stringSC recognized the end of the string. The action endString yields the token constructed out of the string and ensures that future invokations of alexScan happen with startcode 0.

The last two rules accumulate characters in the string. The first matches ordinary characters and the second the escape sequences \n, \t and \".

The patterns for the comments are similar:

<0,commentSC>            "/*"              {beginComment}
<commentSC> "*/"              {endComment}
<commentSC> [.\n]            ;

The first rule recognizes the beginning of a comment. The prefix <0,commentSC> specifies that it applies both in the middle of ordinary program text as well as in the middle of a comment when the start code is commentSC. This is to handle nested comments. The second rule handles the ending of a comment. Between then beginComment and endComment must keep track of the nesting level, with endComment switching back to start code 0 only when all levels of comments have been closed.

The last rule just ignores all characters in a comment. Remember that the catch-all character class . does not match newlines. Therefore we include an explicit \n to allow comments to span lines.

The state

We use the State monad as our parser/lexer monad. The state is of type ParseStatewhich keeps track of the nesting level of comments and accumulate characters in strings. The accompanying type AlexInput keeps tracks of line numbers. The implementation of these types are in the file LexParseCommon.hs.

data AlexInput
  = AlexInput {
    aiprev::Char,
    aibytes::[Word8],
    airest::String,
    ailineno::Int}
  deriving Show

-- [...]
data ParseState = 
     ParseState {input::AlexInput,
                 lexSC::Int,       --Lexer start code
                 commentDepth::Int,--Comment depth
                 stringBuf::String --Temporary storage for strings
                }
     deriving Show

The Lexer Actions

Let’s now go back to the lexer actions in Lexer.h. The lexer actions have type

type LexAction = Int->String->P (Maybe Token)

with the arguments being the number of characters matched and the matching string. The result is Maybe Token since some actions, such as those marking the beginning of a string, or the accumulation of characters within a string, may not yield a token.

String lexing actions

beginString sets the startcode to stringSC. appendString and escapeString add characters to the beginning of stringBuf. endString sets the start code back to 0 and returns the string saved in stringBuf reversed (since the characters were added to the beginning of the string). The characters are added to the beginning rather than the end of the string as the former is more efficient. The actions get and put for reading and writing the state are provided by the State monad.

beginString::LexAction
beginString _ _ = do
  s <- get
  put s{lexSC = stringSC}
  return Nothing

appendString::LexAction
appendString _ (c:_) = do
  s <- get
  put s{stringBuf = c:(stringBuf s)}
  return Nothing

escapeString::LexAction
escapeString _ (_:c:_) = do
  let unesc =
        case c of
          'n' -> '\n'
          't' -> '\t'
          '"' -> '"'
  s <- get
  put s{stringBuf = unesc:(stringBuf s)}
  return Nothing
        
endString::LexAction
endString _ _ = do
  s <- get
  let buf = stringBuf s
  put s{lexSC = 0, stringBuf = ""}
  return $ Just $ TString (reverse buf)

Comment lexing actions

The implementation of the comment lexing actions is similar. The beginning of a comment changes the startcode to commentSC and increments commentDepth. The ending of a comment decrements commentDepth, setting the start code back to 0 when commentDepth reaches 0.

beginComment::LexAction
beginComment _ _ = do
  s <- get
  put s {lexSC = commentSC,
         commentDepth = (commentDepth s)+1}
  return Nothing

endComment _ _ = do
  s <- get
  let cd = commentDepth s
  let sc' = if cd==1 then 0 else commentSC
  put s {lexSC=sc',commentDepth=cd-1}
  return Nothing

The driver

The monadic action readToken in Lexer.x provides the interface with alex. The following line calls alexScan with the startcode saved in the state

   s <- get
   case alexScan (input s) (lexSC s) of  

When alex returns AlexToken indicating a match we call the corresponding action, looping back by calling ourselves if the action returns Nothing indicating that no token is generated.

-- [...]
    AlexToken inp' n act -> do 
      let (AlexInput{airest=buf}) = input s
      put s{input = inp'}
      res <- act n (take n buf)
      case res of
        Nothing -> readToken
        Just t -> return t

The rest of the code is routine. lexer in Lexer.h wraps up readToken in the interface expected by happy. The happy parser itself is in Parser.y. main in Main.hs parses texts read in from the standard input.