Posts /

Building a Thrift parser with Parboiled and Java

19 Dec 2016

The project discussed in this article lives on Github.

Overview

I recently built a parser that can translate the contents of a Thrift file into an Abstract Syntax Tree. The library that I am open-sourcing today allows users to gather information about a given struct, service, or any other Thrift type defined in a Thrift file, without the use of reflection. Each node in the tree represents a Thrift type that is encountered during the parsing process, and is organized hierarchically under a root Document node. Being able to programatically navigate the structure of a Thrift file opens the door for all kinds of opportunties, so if you use this software in any interesting capacity, be sure to let me know and I’ll share it on the README.

Parboiled

During the course of this project, I leveraged the parboiled Java library, which supports recursive pattern matching, as well as something called a Value Stack. The Value Stack is a temporary storage mechanism, that can be manipulated in ways that make building an AST relatively straight forward (discussed in more detail below). Regardless of which language or library you use, the following steps will help guide your efforts when building a custom parser.


Understanding the Input

The first step in building a parser is figuring out the syntax rules of the input you will be parsing. Luckily for us, the Thrift interface definition language (IDL) is detailed on apache.org (see here), and the documentation helpfully includes the rule definitions in the following format:

[37] Literal         ::=  ('"' [^"]* '"') | ("'" [^']* "'")

If you are building a parser for your own custom DSL, or for an input whose syntax hasn’t been documented as clearly as above, you could spend a fair amount of time on this first step. I would encourage you to document the syntax rules for the target input completely before moving on to the rule creation phase (see below).

Create PEG Rules

Once the syntax has been defined, the next step is to translate these rules into parboiled expressions. These rules form what is called a Parsing expression grammar, or PEG. Using the parboiled Java library, we can express the rule above with the following Java code:

@SuppressWarnings({"InfiniteRecursion"})
@BuildParseTree
public class ThriftIdl extends BaseParser<Object> {

    /**
     * [37] Literal ::=  ('"' [^"]* '"') | ("'" [^']* "'")
     */
    Rule Literal() {
        return FirstOf(
                SingleQuotedLiteral(),
                DoubleQuotedLiteral());
    }

    Rule DoubleQuotedLiteral() {
        return Sequence(
                "\"",
                ZeroOrMore(NoneOf("\"")),
                "\"");
    }

    Rule SingleQuotedLiteral() {
        return Sequence(
                "'",
                ZeroOrMore(NoneOf("'")),
                "'");
    }
    
    // ...
}

The following table, which I found in the parboiled wiki, helps us make sense of some of the expressions above.

Name Common Notation parboiled Primitive
Sequence a b Sequence (a, b)
Ordered Choice a / b FirstOf (a, b)
Zero-or-more a * ZeroOrMore (a)
One-or-more a + OneOrMore (a)
Optional a ? Optional (a)
And-predicate & a Test (a)
Not-predicate ! a TestNot (a)

As you can see, each expression in the example above is responsible for matching some input text against a predefined rule. Once you have created all of your PEG rules, you will have a recognizer. Recognizers can be used to validate whether or not some type of input is valid (according to the rules you specify), but are still relatively limited in functionality.

Build an AST using the Value Stack

For more complex uses cases, we need to utilize the Value Stack during the parsing process, and build a structure that we can work with programatically. In order to create an AST, we need to push the nodes we are interested in onto the stack when certain rules are matched. I generally prefer to split up the work into 3 separate files:

For example:

# First file
@SuppressWarnings({"InfiniteRecursion"})
@BuildParseTree
public class ThriftAst extends BaseParser<Object> {

    ParserActions actions = new ParserActions();

    /**
     * [37] Literal ::=  ('"' [^"]* '"') | ("'" [^']* "'")
     */
    Rule Literal() {
        return FirstOf(
                SingleQuotedLiteral(),
                DoubleQuotedLiteral(),
                actions.pushLiteralNode());
    }
    
    // ...
}
# Second file
/**
 * Actions for manipulating the parboiled value stack.
 */
class ParserActions {
	
    Action pushLiteralNode() {
        return new Action() {
            @Override
            public boolean run(Context context) {
                LiteralNode node = new LiteralNode(context.getMatch().trim());
                context.getValueStack().push(node);
                return true;
            }
        };
    }
	
    // ...
}
# Third file
public class Nodes {

    //===================================================================
    // Node classes
    //===================================================================

    public static class AstNode extends ImmutableGraphNode {

        public String toString() {
            return String.format("%s:", this.getClass().getSimpleName());
        }

        @Override
        public java.util.List<AstNode> getChildren() {
            return new ArrayList<>();
        }
    }
    
    public static class LiteralNode extends AstNode {
        public String value;

        LiteralNode(String value) {
            this.value = value;
        }
    }

Another thing I would highly recommend is making sure your AST nodes inherit from a common class, and that your base class is one of the types defined in the org.parboiled.trees package. This isn’t required, but utilizing one of these base classes will allow you to take advantage of some additional functionality that was added by the parboiled authors.

For example, in our Thrift parser, all of the AST nodes inherit from ImmutableGraphNode:

public static class AstNode extends ImmutableGraphNode {
    // ...
}

This allows us to use some really cool functions out of the box, without further work. For example, to visualize our AST, we can simply use the GraphUtils.printTree method.

What Now?

Checkout the parboiled library and the java-thrift-parser on Github, and try writing your own parser. It was a huge learning experience for me, and I highly recommend taking on a project with parboiled for anyone interested in building their own parser.