B A S I L Basil is a backtracking LR(1) parser generator. Basil reads a grammar file and generates data that can be used to initialize an LR(1) parser with backtracking abilities. Basil also generates the node classes that are used by the parser to construct the abstract syntax tree. Semantic actions are integrated into the parser by using navigators. Navigators are objects that can traverse the abstract syntax tree, either bottom-up as the tree is being created, or top-down. A separate utility is provided to help write navigators. A parser built with Basil is event-driven. Tokens can be passed to the parser as they are formed. This model makes it very easy to string several parsers together. CONFIGURATION Basil accepts only the name of the grammar file on the command line. Options are set in a configuration file. The name of the configuration file is controlled by the BASIL_CONFIG_FILE environment variable. The default value is "./basil.cfg". Options are set in the configuration file using keyword-value notation, one setting per line. Whitespace must be used to separate the keyword from the value. The hash character ('#') starts a comment. All remaining characters on the line are ignored. The following table lists the keywords and their default values. ---------------------------------------------------------------------- Keyword Default Value base-node-type basl_Node base-navigator-type basl_Navigator base-token-type basl_Token header-ext h source-ext cpp use-underscores-in-names no get-location no location-type basl_Loc get-node-ptr no inline-node-ctor no inline-node-methods no node-include-filename cc_nodeheaders.h parse-tables-filename cc_tables.inc token-enum-filename cc_tokens.inc states-filename cc_states.txt lr-node-type basl_LRNode lr-node-refptr-type basl_LRNodeRefPtr lr-node-refptr-header basl_LRNodeRefPtr.h lr-node-refptr-vector-type basl_LRNodeRefPtrVector lr-node-refptr-vector-const-iter-type basl_LRNodeRefPtrVectorConstIter lr-node-refptr-vector-header basl_LRNodeRefPtrVector.h ---------------------------------------------------------------------- GRAMMAR FILE The syntax of the grammar file can best be described with its own grammar. Below is the complete grammar. The rule with the left-hand side "start" is the start rule. Upper case symbols are tokens. Comments begin with the hash ('#') character. This is the same syntax accepted by Basil. # start rule start -> file # file file -> decl-seq # declaration sequence decl-seq -> decl decl-seq -> decl-seq decl # declaration decl -> rule-decl decl -> token-decl # rule declaration rule-decl -> rule-name-opt rule-symbol ARROW rule-symbol-seq-opt follow-seq-opt # rule symbol seqence opt rule-symbol-seq-opt -> rule-symbol-seq rule-symbol-seq-opt -> # rule symbol sequence rule-symbol-seq -> rule-symbol rule-symbol-seq -> rule-symbol-seq rule-symbol # rule symbol rule-symbol -> symbol attrib-seq-opt # symbol symbol -> IDENT # attribute sequence opt attrib-seq-opt -> attrib-seq attrib-seq-opt -> # attribute sequence attrib-seq -> attrib attrib-seq -> attrib-seq attrib # attributes attrib -> GT attrib -> PLUS attrib -> GT_BANG attrib -> PLUS_BANG attrib -> STAR attrib -> LT attrib -> NUMBER # rule name opt rule-name-opt -> rule-name rule-name-opt -> # rule name rule-name -> LBRACK IDENT base-name-opt attrib-name-seq-opt RBRACK rule-name -> LBRACK LPAREN IDENT RPAREN RBRACK rule-name -> LBRACK RBRACK # base name opt base-name-opt -> base-name base-name-opt -> # base name base-name -> COLON IDENT # attrib name seq attrib-name-seq-opt -> attrib-name-seq attrib-name-seq-opt -> # attrib name seq attrib-name-seq -> attrib-name attrib-name-seq -> attrib-name-seq attrib-name # attrib name attrib-name -> IDENT # follow sequence opt follow-seq-opt -> follow-seq follow-seq-opt -> # follow seq follow-seq -> LBRACE rule-symbol-seq RBRACE # token declaration token-decl -> LBRACK IDENT RBRACK token-name-seq # token name seq token-name-seq -> token-name token-name-seq -> token-name-seq token-name # token name token-name -> IDENT The tokens used in the grammar are listed in the table below. ------------------------------------------------------------ ARROW -> IDENT identifier (may contain '-') GT > GT_BANG >! PLUS + PLUS_BANG +! LT < STAR * NUMBER unsigned integer LBRACK [ RBRACK ] LPAREN ( RPAREN ) LBRACE { RBRACE } COLON : ------------------------------------------------------------ The grammar file must contain at least one start rule. As in the grammar above, the rule with the left-hand side "start" indicates a start rule. The newline character is used to terminate comments. In all other contexts the new newline character is treated as whitespace. Tokens must be in upper case. RULE NAME A rule can optionally have a name. There are three forms of a rule name: rule-name -> LBRACK IDENT base-name-opt attrib-name-seq-opt RBRACK rule-name -> LBRACK LPAREN IDENT RPAREN RBRACK rule-name -> LBRACK RBRACK The first form defines a node class. The identifier names the class. The parser will construct a new instance of the class when the rule is reduced. The instance is used to represent the rule in the abstract syntax tree. By default, the class is derived from the base node class. The value associated with the configuration keyword base-node-type names the base node class. All nodes must derive, directly or indirectly, from this class, including tokens. However, instead of deriving directly from the base node class, you can specify another class by giving a base-name. Also, you can specify additional classes to derive from by specifying an attrib-name-seq. Typically, these classes contain attributes. The second form of a rule name can be used to associate a number of rules with the same node. This can prevent writing duplicate code. The node must be defined once, as above. The other rules can refer to the node by using this form of a rule name. The last form of a rule name only has effect on rules of the form "a -> b". The rule name forces a reduction on the rule, just as if the rule had an associated node. This is needed because Basil always encodes the follow symbol as part of a reduction, so reductions of this form are normally bypassed. When a reduction is forced, the child node is not lost, it remains on the stack. If a rule has more than one symbol on its right-hand side and it does not have an associated node, the child nodes will be lost when the rule is reduced. This is how the tree can be pruned. Nodes are reference counted so memory leaks will not occur. EXPLICIT FOLLOW SEQUENCE You can give a rule an explicit follow sequence using the syntax: follow-seq -> LBRACE rule-symbol-seq RBRACE Any rule can be given a follow sequence but it is normally only used on start rules to specify a different end token (or tokens). By default, EOT is the end token. Both tokens and nonterminals can be given. If a nonterminal is given, all the tokens in its first set are included. If any nonterminal symbol in the sequence can go to null, the computed follow set is also included. Symbols in this set can have attributes. SYMBOL ATTRIBUTES Basil has a number of symbol attributes, such as GT, PLUS, GT_BANG, and PLUS_BANG, that are used to resolve conflicts. All conflicts must be resolved. That is, you must indicate an order for the parser to try the actions on a conflict. This is done by specifying priorities. On a conflict, the action that leads to parsing the token with the highest priority is tried first. If the action results later in a syntax error (possibly due to a semantic check), the parser will bactrack and try the action that leads to parsing the token with the second highest priority, and so on. For example, consider the shift/reduce conflict in the rules below. As described in more detail below, the GT attribute gives X in the first rule a priority of one. Because it has a higher priority than the X in the second rule, the parser will reduce first. 1) a -> b X > 2) b -> X 3) b -> If one of the tokens involved in a conflict has exclusive priority, only the action that leads to parsing it will be tried. In other words, the parser will not guess. If more than one token involved in a conflict has exclusive priority, the one with the highest priority is picked. SHIFT PRIORITY The attribute GT gives a symbol shift priority. One or more GT attributes can be given to a symbol. Basically, it means you want to shift the symbol. If given to a token, the token is simply given priority. However, if given to a nonterminal, all tokens derived from it are given priority. To see how this works, consider the following partially parsed rule in some state. The symbol Y here can either be a token or a nonterminal. X -> ... .Y ... Let "..." mean a sequence of zero or more symbols. The dot in this rule is just before the symbol Y. Suppose we define the following variables. o Let A be the shift priority of the rule. A is computed as described here. o Let B be the shift priority of X. o Let C be the shift priority of Y. o Let T = A + B + C. If Y is a token, then Y in this context is given a priority of T. If Y is a nonterminal, then the shift priority of every nonkernel rule in the state that has the left-hand side Y is set to T, if T is greater. This process continues over all rules in the state until the shift priorities are distributed. The shift priority of the rule remains with the rule as it is parsed (as the dot moves to the right). So, for example, the token C below has a total priority of three in the initial state. 0) start -> a 1) a -> b > 2) b -> c > 3) c > -> C If the grammar is circular, the shift priorities are correctly distributed. EXCLUSIVE SHIFT PRIORITY The attribute GT_BANG gives a symbol exclusive shift priority. Exclusive shift priority is a flag. Otherwise, it behaves like the GT attribute except it gives exclusive priority to a token. REDUCE PRIORITY The attribute PLUS gives a symbol reduce priority. One or more PLUS attributes can be given to a symbol. It has effect only on nonterminals. Basically, it means you want to reduce the symbol. It gives priority to all tokens following it. To see how this works, consider the the following partially parsed rule in some state. The symbol Y here is a nonterminal. X -> ... .Y Z1 Z2 ... ZN Again, the dot in this rule is just before the symbol Y. N symbols follow Y: Z1 through ZN. Suppose we define the following variables. o Let A be the shift priority of the rule. o Let B be the shift priority of X. o Let C be the reduce priority of Y. o Let T = A + B + C. o Let D be the reduce priority of X. o Let S = C + D. o Let FIRST(Z) be the set of tokens that begin the strings derived from symbol Z. Each token in this set has an associated priority. o Let FOLLOW(X) be the set of tokens that can appear to the right of the rule in this context. Each token in this set has an associated priority and a nonterminal symbol called the shortcut. FOLLOW(X) is computed as described here. o Let FOLLOW(Y) be the set of tokens that can appear to the right of Y in this context. Each token in this set has an associated priority and a nonterminal symbol called the shortcut. FOLLOW(Y) is computed as described here. So then, FOLLOW(Y) is computed as follows: For i = 1 to N: For each token in FIRST(Zi): Let U be the associated priority of the token. Let V = U + T. Add token with reduce priority V and shortcut Y to FOLLOW(Y). Continue only if Zi can go to null. If N is 0, or if all Z1 through ZN can go to null: If the rule is "X -> Y" and it does not have a rule name: Add FOLLOW(X) to FOLLOW(Y), adding S to each priority in FOLLOW(X). Otherwise: Add FOLLOW(X) to FOLLOW(Y), adding S to each priority in FOLLOW(X), and setting Y as the shortcut. FOLLOW(Y) is then merged with the FOLLOW set of every nonkernel rule that has the left-hand side Y. This procedure continues over all rules in the state until nothing can be added to any FOLLOW set. The FOLLOW set remains with the rule as it is parsed (as the dot moves to the right). The reduce actions in a state are formed from the FOLLOW sets of the rules with the dot after the last symbol on the right-hand side. For each token in such a FOLLOW set, a reduce action on the token is formed. The shortcut of the token becomes the follow symbol of the reduce action. The priority of the reduce action is the priority of the token plus the reduce priority of the left-hand side of the rule. So, here is an example: 0) start -> a + d >> 1) a -> X >> 2) a -> 3) d -> X In the initial state, X will have a shift priority of two and a reduce priority of three, so the parser will reduce first. EXCLUSIVE REDUCE PRIORITY The attribute PLUS_BANG gives a symbol exclusive reduce priority. Exclusive reduce priority is a flag. Otherwise, it behaves like the PLUS attribute except it gives exclusive priority to a token. ACCEPT The attribute STAR is used to indicate an accepting symbol. It has effect only on nonterminals. When the parser follows an accepting symbol all pending parses from the start of the symbol are canceled. STICKY FOLLOW The attribute LT is used to keep a follow set on a reduction. It has effect only on nonterminal symbols. Like Yacc, the most common reduction in a state is performed by default. This technique greatly reduces the size of the parse tables. The LT attribute overrides this behavior. When the attribute LT is used on a symbol, the symbol will be followed only if a valid token is next, never by default. To see how this works, take a look how rule-decl is actually written. # rule declaration rule-decl <* -> rule-name-opt rule-symbol ARROW rule-symbol-seq-opt > follow-seq-opt rule-symbol-seq-opt is given shift priority so that the parser first tries any rule-symbols after the ARROW as the right-hand side of the rule. This is necessary because here rule-decl is an accepting symbol, so when followed, or when the rule is reduced, all pending parses from the start of the symbol will be canceled. rule-decl is given the LT attribute as well so that the rule will be reduced only when a valid token is next. This will prevent the parser from including in the right-hand rule-symbol set the left-hand rule-symbol of the following rule. LEXICAL STATE The final attribute that can be given to a symbol is a number. This number represents the lexical state of the symbol. It is used to provide context dependent lexing. The lexer can query the parser for the current lexical state before it lexes the next token. Every parsing state has a corresponding lexical state. By default, the lexical state is 0. Below is an example. 1) start -> a b 2) a -> X 3) b -> Y 1 4) b -> Z Here, after the parser shifts the X, the lexical state will be 1. Basil will produce an error if there is a lexical state conflict. For example, lexical states 1 and 2 conflict in the grammar below. 1) start -> a b 2) a -> X 3) b -> Y 1 4) b -> Z 2 A lexical state can be given to nonterminals and tokens in an explicit follow set. TOKEN DECLARATION The configuration keyword base-token-type must name the base type of your tokens. By default, Basil will assume your tokens are of this type. However, a token declaration can be used to specify the exact type of one or more tokens. The tokens types are needed in order to generate the node classes. Below is an example token declaration. [XToken] X Y Z Here, tokens X, Y and Z are given the type XToken. A token declaration can also be used to introduce additional tokens. NODES Basil generates a node class for every node that is defined in the grammar file. One header file and one source file is generated per class. Consider the following grammar that accepts qualified type names. # start rule start -> type-name # type name [basl_TypeNameNode] type-name -> nested-name-spec-opt IDENT # nested name specifier opt nested-name-spec-opt -> nested-name-spec-opt -> nested-name-spec # nested name specifier [basl_NestedNameSpec1Node : basl_NestedNameSpecNode] nested-name-spec -> DCOLON [basl_NestedNameSpec2Node : basl_NestedNameSpecNode] nested-name-spec -> nested-name-spec-opt IDENT DCOLON Basil will generate four node classes from this grammar. The files it will generate are: basl_Node.h basl_Node.cpp basl_TypeNameNode.h basl_TypeNameNode.cpp basl_NestedNameSpec1Node.h basl_NestedNameSpec1Node.cpp basl_NestedNameSpec2Node.h basl_NestedNameSpec2Node.cpp Notice that the base node class is also generated. Here, it is named basl_Node, the default name. The header and source file extensions can be changed in the configuration file. The class basl_NestedNameSpecNode is not generated because Basil simply does not know anything about it. Most likely, basl_NestedNameSpecNode would contain one or more attributes or derive from one or more attribute classes. The generated file basl_TypeNameNode.h is shown below. ------------------------------------------------------------ // basl_TypeNameNode.h // // type-name -> nested-name-spec-opt IDENT // #ifndef basl_TypeNameNode_h #define basl_TypeNameNode_h #include "basl_LRNodeRefPtr.h" #include "basl_LRNodeRefPtrVector.h" #include "basl_NestedNameSpecNode.h" #include "basl_Node.h" #include "basl_Token.h" // // basl_TypeNameNode // class basl_TypeNameNode : public basl_Node { // methods public: // constructor basl_TypeNameNode (basl_LRNodeRefPtrVectorConstIter children); // destructor ~ basl_TypeNameNode (); // true if nested-name-spec-opt is not null bool hasNestedNameSpecOpt () const; // get nested-name-spec-opt basl_NestedNameSpecNode & getNestedNameSpecOpt () const; // get IDENT basl_Token & getIDENT () const; // children private: // nested-name-spec-opt basl_LRNodeRefPtr m_nestedNameSpecOpt; // IDENT basl_LRNodeRefPtr m_IDENT; // methods public: // navigate void navigate (basl_Navigator const & navigator); }; #endif ------------------------------------------------------------ This file includes the base node header file, the child node header files, and two additional files containing typedefs. The child node header files are included only if Basil can determine the actual node types of its children. Here, the first child node, nested-name-spec-opt, if not null, could be one of two types. However, both types derive from a common base. The two additional files are: o basl_LRNodeRefPtr.h. Must contain the typedef: basl_LRNodeRefPtr, a reference counted pointer to an basl_LRNode. basl_LRNode is the real base class of all nodes. o basl_LRNodeRefPtrVector.h. Must contain the two typedefs: basl_LRNodeRefPtrVector, a vector of basl_LRNodeRefPtr. basl_LRNodeRefPtrVectorConstIter, the const iterator of the vector. The filenames and typedef names can be changed in the configuration file. The rest of the header file is self explanatory. The last function, navigate, is part of the navigator mechanism. Navigators are described next. NAVIGATORS A navigator traverses the abstract syntax tree using the double dispatch mechanism. Navigators all derive from a base navigator class. The base navigator class is named basl_Navigator by default. Basil generates this class. The name can be changed in the configuration file. In the base node class, a virtual function called navigate initiates the double dispatch mechanism. navigate takes one argument, the navigator. Every generated leaf node class overrides navigate. In the body of navigate, each class calls an overloaded virtual function of the navigator passing itself (*this) as the argument. Tokens must not override navigate. Since writing a navigator by hand would be onerous, a separate utility called Navgen can be used to generate the navigator class from a specification language. Navgen generates a static interface function of the navigator called submit that creates an instance of the navigator then passes it to the navigate function of a specified node. submit can take additional arguments and return a value. Navgen is described in a separate document. A navigator can optionally be slippery. Top-down navigators tend to be slippery, while bottom-up navigators do not. By default, if a navigator is slippery, it will slip down certain kinds of nodes in the abstract syntax tree. Nodes like the following: a -> b a -> a SOME_TOKEN b a -> a b Typically, on nodes like these, the only action would be to send the navigator down to the child (nonterminal) nodes, from left to right. A slippery navigator does this by default. The default action is implemented in the base navigator class. The slippery flag is passed to the constructor of the base navigator. PARSER DATA FILES Basil generates the following data files. The filenames can be changed in the configuration file. o cc_nodeheaders.h This file contains include preprocessor directives for all the generated node header files. It is included in cc_tables.inc and in the base navigator source file. o cc_tables.inc This file contains the parser tables. This file must be included in your source file that creates the one instance of your basl_LRParserData object (see below). o cc_tokens.inc This file contains an enum enumerating all token kinds. Typically, this file is included inside your base token class, or inside a token kind struct. o cc_states.txt This file is a human-readable listing of the backtracking LR(1) parse states. CONSTRUCTING A PARSER Basil generates only the data for your parser. You must build the actual parser. You can create either an LR(1) parser or an LR(k) parser, depending upon your grammar. If your grammar is LR(1), an LR(1) parser will be sufficient. Whatever kind of parser you write, you will need to initialize your parser with a pointer to a basl_LRParserData object. It is best to write a function that returns a pointer to this object, since you might have several parsers working from the same data. Your parsers then can simply call this function in their initialization. Typically, you would write two files: "basl_GetParserData.h" and "basl_GetParserData.cpp". Your header file might look like this: ---------------------------------------------------------------------- // basl_GetParserData.h // #ifndef basl_GetParserData_h #define basl_GetParserData_h // return pointer to parser data extern class basl_LRParserData * getParserData (); // parser start states #define MAIN_START_STATE 0 // other start states // ... #endif ---------------------------------------------------------------------- The header file is also a good place to list your start states. State 0 corresponds to your first start rule, state 1 corresponds to your second start rule, and so on. Your source file might look like this: ---------------------------------------------------------------------- // GetParserData.cpp // // includes #include "basl_GetParserData.h" // header file from Basil installation #include "basl_LRParserData.h" // basil generated parser tables (must be included after // basl_LRParserData.h) #include "cc_tables.inc" // // get parser data // basl_LRParserData * getParserData () { // initialize the one and only basl_LRParserData object static LRParserData data (state_table, goto_table, lex_state_table, rule_size_table, rule_table, token_table, get_node); // return pointer to object return & data; } ---------------------------------------------------------------------- Notice that data is a static object of getParserData. This ensures that there is only one object, and calling the function is the only way to get a pointer it. The objects used to initialize data are defined in cc_tables.inc. The objects are all static const data. Since an LR(1) parser is easer to write, it is described first. Both kinds of parsers will be written as classes. As classes, you can have any number of parsers active at once. LR(1) PARSER An LR(1) parser is built from a basl_CoreLR1Parser. Its interface is described here. o basl_CoreLR1Parser::basl_CoreLR1Parser (basl_LRParserData * data, int start_state) Constructor. A basl_CoreLR1Parser must be initialized with a pointer to the basl_LRParserData object and a start state. o int basl_CoreLR1Parser::lookupNextCmd (int kind) This function returns the next parse command given the kind of the next token. The command returned will be one of the commands defined in the file basl_LRCommand.h. The commands are: LR_REDUCE LR_SHIFT LR_FAIL LR_DONE o basl_LRNodeRefPtr basl_CoreLR1Parser::reduce () This function instructs the core parser to reduce. It returns a reference counted pointer to the node created. If no node is created, a null pointer will be returned. This function should be called only if lookupNextCmd returns LR_REDUCE. After this function is called, lookupNextCmd should be called again. Eventually, lookupNextCmd will return either LR_SHIFT, LR_DONE or LR_FAIL. o void basl_CoreLR1Parser::shift (int kind, basl_LRNodeRefPtr const & token) This function instructs the core parser to shift the specified token. The token kind must also be specified. This function should be called only if lookupNextCmd returns LR_SHIFT. o basl_LRNodeRefPtr basl_CoreLR1Parser::done () This function returns a reference counted pointer to the root node of the abstract syntax tree. If the entire tree has been pruned, a null pointer will be returned. This function should be called only if lookupNextCmd returns LR_DONE. o int basl_CoreLR1Parser::getLexState () const This function returns the current lexical state. The result will always be an integer greater than or equal to zero. Zero is the default lexical state. o void basl_CoreLR1Parser::restart (int start_state) This function restarts the parser at the specified start state. Suppose basl_LR1Parser is the name of your parser. Your header file might look something like this: ---------------------------------------------------------------------- // basl_LR1Parser.h // #ifndef basl_LR1Parser_h #define basl_LR1Parser_h // includes #include "basl_CoreLR1Parser.h" // // basl_LR1Parser // class basl_LR1Parser { // methods public: // constructor basl_LR1Parser (); // parse token void parse (basl_LRNodeRefPtr const & token); // get lex state int getLexState () const; // data private: // core LR(1) parser basl_CoreLR1Parser m_core_parser; }; #endif ---------------------------------------------------------------------- The function parse parses the token passed to it. It then returns. basl_LRNodeRefPtr is a reference counted pointer to a basl_LRNode. The core parser makes no assumptions about your nodes or tokens, except of course that they are basl_LRNodes. The function getLexState returns the current lexical state. You only need this function if your lexer is context dependent. Your source file might look something like this: ---------------------------------------------------------------------- // basl_LR1Parser.cpp // // header file #include "basl_LR1Parser.h" // this file declares the getParserData function and start states #include "basl_GetParserData.h" // this file declares the LR parse commands #include "basl_LRCommand.h" // // constructor // basl_LR1Parser::basl_LR1Parser () : m_core_parser (getParserData (), MAIN_START_STATE) { } // // parse // void basl_LR1Parser::parse (basl_LRNodeRefPtr const & token) { // token kind, assume token_cast downcasts to token, and // member function getKind returns kind of token int kind = token_cast (token).getKind (); // reduction loop int code; while ((code = m_core_parser.lookupNextCmd (kind)) == LR_REDUCE) { // get node on reduction bals_LRNodeRefPtr node = m_core_parser.reduce (); // perform semantic actions on node using navigator, navigator returns // false if semantic actions should cause syntax error if (node.isSet () && ! BottomUpNavigator::submit (static_cast (* node))) { // fail parser code = LR_FAIL; break; } } // check if failed if (code == LR_FAIL) { // syntax error, signal error and throw exception or return after // setting error code // ... } // check shift if (code == LR_SHIFT) { // shift token m_core_parser.shift (kind, token); } // otherwise done else { // get root node of abstract syntax tree basl_LRNodeRefPtr node = m_core_parser.done (); // perform semantic actions top-down on tree using navigator if (node.isSet ()) { // use top down navigator TopDownNavigator::submit (static_cast (* node)); } } } // // get lex state // int basl_LR1Parser::getLexState () const { // return lex state from core parser return m_core_parser.getLexState (); } ---------------------------------------------------------------------- As you can see, the core parser does the real work. Your parser is just a wrapper around it. Navigators are used to perform semantic actions. In this example, semantic actions are performed bottom-up as the abstract syntax tree is being created, as well as top-down from the root of the tree. Only one strategy is usually needed. Note also that your parser must signal syntax errors. The core parser does not signal any errors. LR(K) PARSER An LR(k) parser is built from a basl_CoreLRkParser. The core parser does not perform the backtracking, instead your parser must do it. This makes the parser slightly more compilicated. The interface to basl_CoreLRkParser is described here. o basl_CoreLRkParser::basl_CoreLRkParser (LRParserData * data, int start_state) Constructor. A basl_CoreLRkParser must be initialized with a pointer to the basl_LRParserData object and a start state. o void basl_CoreLRkParser::lookupCmdList (int kind) This function looks up the list of commands that should be performed for the next token. The token kind must be passed to the function. Afterwards, the function getNextCmd should be called to get the next command. o int basl_CoreLRkParser::getNextCmd (bool & guess) This function returns the next parse command. The command will be one of the following: LR_REDUCE LR_ACCEPT LR_DONE LR_SHIFT LR_FAIL These commands are defined in the file basl_LRCommand.h. If there are remaining commands, guess will be set to true. o basl_LRNodeRefPtr basl_CoreLRkParser::reduce (int & num_cancel) This function instructs the core parser to reduce. It returns a reference counted pointer to the node created. If no node is created, a null pointer will be returned. This function should be called only if getNextCmd returns LR_REDUCE or LR_ACCEPT. If the command is LR_ACCEPT, num_cancel will be the number of pending trial parsers to cancel. o void basl_CoreLRkParser::shift (int kind, basl_LRNodeRefPtr const & token) This function instructs the core parser to shift the specified token. The token kind must also be specified. This function should be called only if getNextCmd returns LR_SHIFT. o basl_LRNodeRefPtr basl_CoreLRkParser::done () This function returns a reference counted pointer to the root node of the abstract syntax tree. If the entire tree has been pruned, a null pointer will be returned. This function should be called only if getNextCmd returns LR_DONE. o int basl_CoreLRkParser::getLexState () const This function returns the current lexical state. The result will always be an integer greater than or equal to zero. Zero is the default lexical state. In a backtracking parser, the current lexical state is the lexical state of the current trial parser. o void basl_CoreLRkParser::restart (int start_state) This function restarts the parser at the specified start state. Suppose basl_LRkParser is the name of your parser. Your header file might look something like this: ---------------------------------------------------------------------- // LRkParser.h // #ifndef basl_LRkParser_h #define basl_LRkParser_h // includes #include "basl_CoreLRkParser.h" // this file defines a platform independent Deque macro (deque or std::deque) #include "basl_Deque.h" // this file declares the typedef basl_LRNodeRefPtrDeque (a deque of ref counted node ptrs) #include "basl_LRNodeRefPtrDeque.h" // // basl_LRkParser // class basl_LRkParser { // methods public: // constructor basl_LRkParser (); // parse token void parse (basl_LRNodeRefPtr const & token); // get lex state int getLexState () const; // private types private: // trial parser struct TrialParser { // core LR(k) parser basl_CoreLRkParser core_parser; // current token basl_LRNodeRefPtr token; // index in queue of the next token int index; // constructor TrialParser (); }; // trial parser deque typedef basl_Deque TrialParserDeque; // data private: // token queue basl_LRNodeRefPtrDeque m_token_queue; // current trial parser TrialParser m_parser; // pending trial parsers TrialParserDeque m_pending_parser_set; }; #endif ---------------------------------------------------------------------- As you can see, an LR(k) must manage a set of pending trial parsers, as well as the current trial parser. Each trial parser can be at a different location in the source, which is why a token queue is needed. basl_CoreLRkParser is a member of TrialParser, as it must be copied as part of a trial parser. TrialParser contains other data that is specific to the trial parse, like the current token and the index in the token queue of the next token. Any semantic data that is associated with a trial parse should also be included in this struct. The semantic data can be passed to your navigators. For example, in a C++ parser, you would most likely include a scope stack (because in trial mode the scope stack needs to be modified). The source file of your LR(k) parser might look like this: ---------------------------------------------------------------------- // basl_LRkParser.cpp // // includes #include "basl_LRkParser.h" // this file declares the getParserData function and defines the start states #include "basl_GetParserData.h" // this file defines the LR parse commands #include "basl_LRCommand.h" // // trial parser constructor // basl_LRkParser::TrialParser::TrialParser () : core_parser (getParserData (), MAIN_START_STATE), index (0) { } // // LR(k) parser constructor // basl_LRkParser::basl_LRkParser () { } // // parse token // void parse (basl_LRNodeRefPtr const & token) { // push token on queue m_token_queue.push_back (token); // loop until the current trial parser needs the next token while (m_parser.index < m_token_queue.size ()) { // get next token m_parser.token = m_token_queue [m_parser.index ++]; // command code int code; // reduce loop do { // lookup command list for token, assume token_cast downcasts to token and // getKind returns token kind m_parser.core_parser.lookupCmdList (token_cast (m_parser.token).getKind ()); // backtracking loop for (;;) { // get next command bool guess; code = m_parser.core_parser.getNextCmd (guess); if (code != LR_FAIL) { // if guess is true, copy to pending parser set if (guess) { // add to front of queue m_pending_parser_set.push_front (m_parser); } // check if reduction if (code == LR_REDUCE || code == LR_ACCEPT) { // get node and number of trial parsers to cancel int num_cancel; basl_LRNodeRefPtr node = m_parser.core_parser.reduce (num_cancel); // perform semantic actions on node, assume navigator returns // false if semantic actions should cause parser to fail. don't // fail if no node created if (! node.isSet () || BottomUpNavigator::submit (static_cast (* node))) { // cancel pending trial parsers if accept if (code == LR_ACCEPT) { // erase from front of queue m_pending_parser_set.erase (m_pending_parser_set.begin (), m_pending_parser_set.begin () + num_cancel); // if no pending parsers, token queue can be erased upto index if (m_pending_parser_set.empty ()) { // erase from front m_token_queue.erase (m_token_queue.begin (), m_token_queue.begin () + m_parser.index); // next token is at the beginning of the queue m_parser.index = 0; } } // break out of backtracking loop break; } // semantic actions caused parser to backtrack } // otherwise break out of backtracking loop else { break; } } // backtrack if possible if (m_pending_parser_set.empty ()) { // syntax error, signal error and throw exception or return after // setting error code // ... } // copy trial parser from front of queue m_parser = m_pending_parser_set.front (); m_pending_parser_set.pop_front (); // end of backtracking loop } // end of reduce loop } while (code == LR_REDUCE || code == LR_ACCEPT); // check if shift command if (code == LR_SHIFT) { // shift token m_parser.core_parser.shift (token_cast (m_parser.token).getKind (), m_parser.token); } // otherwise parser is done else { // get root node basl_LRNodeRefPtr node = m_parser.core_parser.done (); // perform semantic actions on tree if (node.isSet ()) { // use top down navigator TopDownNavigator::submit (static_cast (* node)); } } } } // // get lex state // int basl_LRkParser::getLexState () const { // return lex state from current trial parser (tokens are only lexed once) return m_parser.core_parser::getLexState (); } ---------------------------------------------------------------------- Well, that's it. See the examples for more detail. To see the rules as they are reduced, define PARSER_DEBUG when you compile the core parser.