Pig New Parser

Background

Presently Pig uses Javacc as a combining tool for lexical and syntactical analysis of Pig script, which may be given as either file content or command line input. While it gets the job done, it also poses a few problems that Pig users and developers have been complaining about:

To address these issues, a new parser is needed to replace Javacc. We expect the following features from the new parser, which are categories as must-have and nice-to-have.

Must-have:

Nice-to-have:

Proposed Parsing Tool

We have compiled and compared a list of parsing tools available. Please check wikipedia.org for an exhaustive list. Because of the license requirement, our selection is actually quite limited. However, one particular tool has attracted our attention. That is ANTLR.

ANTLR stands for ANother Tool for Language Recognition (ANTLR) which uses LL(*) parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is professor Terence Parr of the University of San Francisco. The current version is 3.x.

The good thing about ANTLR is that it meets almost all must-have features and some of the nice-to-have features. The main downside seems to be the fact that with regard to parsing algorithm, LALR is what the majority of us learned in school while ANTLR generates LL(*) parsers, unfortunately.

LL(*) parser takes a top-down approach, which has difficulty parsing grammar rules containing left recursions. To overcome the issue, backtracking is needed, which will have performance impact and potentially memory footprint. Unfortunately Pig grammar does have recursions. It's hoped that the performance cost is acceptable.

Anyway, ANTLR is at our focal spot, and this document is to introduce ANTLR as our new parsing technology to address issues mentioned above.

This project is closed related to Pig turning completeness. Thus, in the follow design discussions, we have included corresponding considerations.

Design

Present Design

In order to understand how ANTLR will fit into Pig frontend, here is some basic workflow that describes how the current parser works.

Proposed Design

As mentioned above, PigServer uses QueryParser, which is generated by Javacc. Since we are replacing Javacc, we will have a new parser generated by ANTLR instead. However, the new parser will not generate logical plan directly. As a matter a fact, there are more than one parsers including one grammar parser and multiple tree parsers to accomplish the parsing, semantic analysis, etc.

In addition, we will completely de-commission the old logical plan. Currently, there are two classes for logical plan. The output of QueryParser is the old logical plan, which is used also for semantic analysis. Then, it will be converted to the new logical plan, which is then optimized and converted to physical plan. Now, we have a new parser, and the parser will generate an AST, which will be used for semantic analysis. In such process, the AST will be transformed in stages and eventually to the new logical plan. The old logical plan class and related classes will be then no long needed.

Based on what ANTLR offers, we will have one lexer, which converts the input character stream to tokens, one grammar parser, which generates AST from the tokens, and a few tree parsers, which further process the AST.

The grammar parser will support an extended Pig grammar syntax, including new Pig constructs for scoped alias, parameter markups, and module imports. These new constructs are only valid in raw Pig scripts that's embedded. They are disallowed in Pig script (after parameter substitution). Note: there was a discussion about limiting places where parameter substitution may be allowed.

The grammar parser is able to parse raw Pig script and substituted Pig script. Thus when a Pig script comes from a user, we need to make sure that new constructs are not present. This is a sort of semantic analysis, which will be naturally performed by a tree parser.

The next level of parsing involves two tree parsers including one that mentioned above. The other one is used for alias dependency analysis (to be designed by Richard).

Then, there will be third level of parsing, which includes those tree parsers for default type insertion, type checking, schema alias validation, sort info insertion, etc. These actions are currently conducted on the logical plan. Now they will be done on the AST instead.

Finally, the AST will be visited for the last time to generate a new logical plan, which will be used for optimization and conversion to physical plan for execution.

Detailed Design

Lexer

ANTLR is a combination tool for both lexer or tokenizer and parser. The two can be specified in a single file. However, for a project like Pig, it's beneficial to separate it in two files.

Lexer defines tokens using regular expressions. PigLexer is an example of the lexer file in the context of Pig. While the eventual content may vary, it covers the current Pig token specifications.

Use the following ANTLR command to generate Pig Lexer:

java -jar antlr-3.2.jar -o ../src/pig PigLexer.g

With the lexer file, ANTLR generates a lexer which can produces a stream of tokens from Pig script. These tokens can be further processed by grammar parser. The following shows the way using the generated lexer. With the lexer file, ANTLR generates a lexer which can produces a stream of tokens from Pig script. These tokens can be further processed by grammar parser. The following shows the way using the generated lexer.

        CharStream input = new ANTLRFileStream( "script.pig" );
        PigLexer lexer = new PigLexer(input);
        Token token;
        while ((token = lexer.nextToken())!=Token.EOF_TOKEN) {
            System.out.println("Token: "+token.getText());
        }

Grammar Parser

The grammar file defines a set of grammar rules that dictates the syntactic requirements. Again, PigParser is an example of the grammar rules in Pig context. While it doesn't include all current grammar rules, but have all the skeletons.

Grammar Parser

The grammar file defines a set of grammar rules that dictates the syntactic requirements. Again, PigParser is an example of the grammar rules in Pig context. While it doesn't include all current grammar rules, but have all the skeletons.

Using a similar ANTLR command, PigParser class is generated from PigParser grammar, which can be used to process tokens generated from PigLexer. The following shows a typical usage:

       CharStream input = new ANTLRFileStream( "script.pig" );
       PigLexer lex = new PigLexer(input);
       CommonTokenStream tokens = new  CommonTokenStream(lex);

       PigParser parser = new PigParser(tokens);
       PigParser.query_return result = parser.query();

       Tree ast = (Tree)result.getTree();

       System.out.println( ast.toStringTree() );

By default, an AST is generated after PigParser parses successfully. The AST is captured in a generic tree structure, with token as payload at each tree node. The token contains source code line number and character offset in the line. These properties are mutable, meaning that line number and offset can be manipulated if necessary. The ANTLR notations such as ^ or ! are used in the grammar file to dictate how the tree looks like. The default tree generation seems sufficient for Pig's usage.

As mentioned in the overview section, this new Pig grammar will include new constructs for parameter substitution. To do that, we need to define the places where parameter is allowed. (Today, anything can be substituted.) For instance, we may say that input file in load clause is allowed, so in the grammar, we will allow a token ($parameter_name) in addition to string literal for the file name in the load clause. As a result, the AST generated will contain these new constructs.

Tree Parsers

The AST obtained above will be further processed by a few tree parsers. A tree parser is a generated visitor for the AST tree structure. Traditionally, these visitors are hand coded. With ANTLR, however, they can be generated by defining tree parser rules similar to the grammar parser rules.

One of the tree parsers immediately processing above AST will be checking if the tree contained the above mentioned new constructs and giving errors if so. This is because the Pig script being parsed may come directly from user. In such cases, the Pig script will be invalid. On the other hand, another tree parser, which is used to process embedded Pig script, expects these new constructs.

After such validation, a chain of tree parsers will be processing the AST.

DefaultDataTypeInserter Parser: PigDefaultDataTypeInserter tree parser is given as an example of tree parsers. This tree parser does nothing but inserting default data type (bytearray) for schema fields for which data types are not specified, such that in subsequent processing, we don't have to worry about unspecified data types. This new in Pig 0.9.

TypeCheckingTreeParser Parser: as the name suggests, type checking tree parser validates data types in Pig scripts to ensure type safety. Currently, this done in TypeCheckingVisitor class traversing a logical plan. As a tree parser, it will be generated to visit AST instead. Also, today's type checking is overloaded with getLoadFuncSpec(), which is quite burdensome. It's to be decided whether this demands an independent tree parser. Please note that we might do type checking in the generated logical plan if it's found easier that way.

SchemaAliasValidator Parser: to ensure that a schema definition contains unique filed names at any given depth. Today we only check at the top-level, which is a big problem. JIRA 1654 is tracking this issue.

SortInfoSetter Parser: to compute whether the output is sorted or not. Presently it's done by SortInfoSetter class. It can be easily done in a tree parser.

Finally, LogicalPlanGenerator Parser: the most significant tree parser to generate a new logical plan. Once the AST is such transformed that it contains all needed information, it will be visited for the last time to create a LogicalPlan instance.

Error Handling

One of the main reason for Pig parser replacement is about error handling, reporting, and recovery. Javacc doesn't do well on those aspects. Thus, it's worthwhile to detail how these will be done in ANTLR.

Looking at the JIRAs about error handling, a majority of the complaints come from the fact that Pig parser doesn't give any meaningful message for syntax error. What is given is messages such as internal error, exception thrown, or error message for a different cause. There are also issues such as syntax inconsistency (such as by-clause in different operators), inconvenience (such as required space around '='), and incompleteness (such as limiting order by expression to be field alias only).

Replacing Javacc with ANTLR, it's expected not only to fix above-mentioned syntactic inconsistency, incompleteness, and inconvenience, and also provide better, meaningful error messages. As a top-down parser, ANTLR is able to clearly state in case of syntactic error what is expected. It's also able to recover from the error by skipping a unexpected token or inserting a missing token. Once it resynchronizes at a good point, it can continue parsing so it will not stop at the first error. This default behavior can be changed such that it stops as soon as it sees a syntactic error. It's a matter of overriding two methods in the generated parser class. It may make sense for Pig to enable ANTLR error recovering to generate all errors for a file input and disable such recovery for command line script input.

Besides recovering, ANTLR also alllows us to customize error messages. Instead of saying "missing IDENTIFIER in FILTER_CLAUSE", we can give a more user-friendly message such as "IDENTIFIER is missing in filter clause". Customized eorror message is possible by overriding parser getErrorMessage() method.

It's also possible to provide our own error handlers, globally or at each grammar rule so that we have more control on what to do when an error is encountered (report, stop, recover, etc.).

With the capabilities that ANTLR provides, it's expected that Pig parsers will be better at error reporting and handling.

Fuzzy Parsing

So far we have been mainly discussing about replacement for Pig QeuryParser, which handles only Pig queries. There are two other parsers, also written using Javacc: Pig file parser, which handles parameter substitution, and PigScriptParser (=GruntParser=), which handles FS and shell commands and passes pig queries to QueryParser. The two parsers essentially does some fuzzy parsing, that is, parsing without knowing complete syntax. Since we have no intention to maintain two different parsing tools in Pig, we need also to replace these two parsers using ANTLR.

Luckily, ANTLR has a filter option in lexer, which, if turned on, can identify interested tokens and ignore unmatched character sequence. Thus, we will replace these two parsers using ANTLR filter parsers.

To limit the scope of this project, we can choose to do this in subsequent releases.

Syntactic and Semantic Cleanup

As mentioned above, there are many issues about Pig script's syntax and semantics. While Alan is looking at semantic cleanup, I have compiled a list of syntactic cleanups that I'd like to address in 0.9 release. These are just proposals, and in no means exhaustive. I will modify this list as the project moves forward. As backward compatibility might be an issue, it's possible that we may not be able to cleanup as much as we wish.

Inconsistent Using Clause

Presently, there are two kind of tokens that may follow "USING" keyword: a function name such as org.apach.pig.TextLoader() which is what user specifies, and a string literal such as 'replicated', 'merge' and so on, which is essentially keywords. Consider to differentiate the two cases with USING or WITH.

Keywords vs String Literals

Consider reserving "replicated', 'merge', 'skeyed' and so on as keywords instead of string literal. Currently Pig takes any string literal in syntax but throws error if the value isn't one of these "keywords". It's better to reserve these words and catch syntactic errors instead.

Unnecessary Parsing of Method Names

Current Pig syntax specifies a method name such as for Pig loader as one or more identifiers concatenated with a period. In rule action, these IDs are actually made to a string. Consider to just ask user to specify a string literal instead. For instance, LOAD 'file.txt' USING 'org.apache.pig.TextLoader' 'arg1' 'arg2';

Function Styled Pig Script

Today, a whole Pig script can be written in a single statement such as STORE(FORECH(LOAD 'xxx') GENERATE $0, $1) ) INTO 'output.txt';. While it may seems cute, it makes Pig script difficult to read and make parsing more complicated. Consider disallowing this in the grammar.

Mixed Use Single and Double Quotes

In Pig grammar, there are cases where both single quote or double quote are allowed for string literal. To be consistent and make grammar simpler, consider standardizing string literals in single quotes.

Confusing Bag Schema Definition

Strictly speaking, bag type doesn't stand as a new data type, but describes the occurrence of a data field. Thus, it's orthogonal to data types. While it's restricted to tuple field today, this property applies to any data type. The current way to specify a bag field causes some confusion. For instance, the following are all valid in current Pig syntax:

A = LOAD 'data' AS ( B : bag { T: tuple( t1:int, t2:int, t3:int ) } );
A = LOAD 'data' AS ( B : bag { T: ( t1:int, t2:int, t3:int ) } );
A = LOAD 'data' AS ( B : { T: tuple( t1:int, t2:int, t3:int ) } );

T is meaningless here, but its existence seems suggesting that there is field with name T of type tuple, causing more confusions.

Consider to simplify the syntax, for instance, LOAD 'data' AS ( B : TUPLE( t1:int, t2:int, t3:int )[] ); or LOAD 'data' AS ( B : TUPLE( t1:int, t2:int, t3:int ) BAG );. When it's legal to apply for scalar fields, we can have LOAD 'data' AS ( a:int[], b:string[] ); or LOAD 'data' AS ( a:int BAG, b:string BAG );

}}}

Unnecessary Quotes around Parameters

In Pig script, there can be placeholders that will be substituted with real values. These parameters are specified in quotes and prefixed with '$', such as '$myfile'. This seems not only inconvenient, but also confusing with string literal, for which no interpretation should be performed. Consider to remove the quotes to make the rule clearer.

Testing

Comprehensive testing is needed because of the scope of changes. We need to ensure a comprehensive coverage, and test with real user queries. We will gather as much user queries as possible to exercise the new parser, making sure that desired backward compatibility is in place.

PigNewParser (last edited 2010-10-08 22:12:49 by XuefuZhang)