Write a recursive descent parser generator

Yet when I have to write a parser I now tend to steer clear of them, resorting to writing one manually. In theory they are great tools.

Write a recursive descent parser generator

Details about the "incremental" mode are listed in the documentation PDF[0] at section 9. The parser does not have access to the lexer.

Instead, when the parser needs the next token, it stops and returns its current state to the user. The user is then responsible for obtaining this token typically by invoking the lexer and resuming the parser from that state. Assuming that semantic values are immutable, a parser state is a persistent data structure: The parser can be re-started in the middle of the buffer whenever the user edits a character.

Because two successive parser states share most of their data in memory, a list of n successive parser states occupies only O n space in memory. I can try to explain a bit. My goal was Merlin https: Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.

How to write a recursive descent parser : programming

Incrementality, part I The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for. With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface.

The user of the parser is handled a pure abstract object that represents the state of the parsing. In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed effectively sharing the prefix.

But on the side, we can also run arbitrary analysis on a parser for error message generation, recovery, syntactic completion, or more incrementality Incrementality, part II Sharing prefix was good enough for our use case parsing is not a bottleneck in the pipeline.

But it turns out that a trivial extension to the parser can also solve your case. Using the token stream and the LR automaton, you can structure the tokens as a tree: In a later parse, whenever you identify a known state number, prefix pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.

If you were to write the parser by hand, this is simply memoization done on the parsing function which is defunctionalized to a state number by the parser generator and the prefix of token stream that is consumed by a call.

In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run and forgetting older parses. Here you are free to choose the strategy: So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: A last kind of sharing you might want is sharing the spine of the tree by mutating older objects.

write a recursive descent parser generator

It is surely possible but tricky and I haven't investigated that at all. Error messages The error message generation is part of the released Menhir version. It is described in the manual and papers by F. I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.

The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e. Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information.

But you would have to do that anyway with a handwritten parser and here the parser generator can help you If you have such a parser generator of course: Menhir is the most advanced solution I know for that, and the UX is not very polished still better than Bison.Until version of Parse::RecDescent, parser modules built with Precompile were dependent on Parse::RecDescent.

Future Parse::RecDescent releases with different internal implementations would break pre-existing precompiled parsers. Recursive descent is the simplest way to build a parser, and doesn’t require using complex parser generator tools like Yacc, Bison or ANTLR. All you need is straightforward hand-written code.

Don’t be fooled by its simplicity, though. Back when I tried to learn how to write a recursive descent parser, the examples I found either ignored correct expression parsing or wrote an additional parse method for each precedence level.

Writing a parser by hand seemed just too much work. The parsing method that we will use is called recursive descent parsing. It is not the only possible parsing method, or the most efficient, but it is the one most suited for writing compilers by hand (rather than with the help of so called "parser generator" programs).

So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.

To be more explicit if after looking at "2*" you generate machine code for computing the double of something this code is sort of You'll use recursive calls in your parser to build the tree in memory.

or something in between (an interpreter wrapped around an intermediate language). If you want an interpreter, a recursive descent parser.

parsing - Lexer and parser in C++ from EBNF - Stack Overflow