Review of Parser library begins today

Louis Tatta
Louis Tatta
Feb 19th, 2024

The review of Zach Laine’s proposed Boost.Parser library begins today and will end on February 28th.

From the introduction page of the documentation:

Boost.Parser is a parser combinator library. That is, it consists of a set of low-level primitive parsers, and operations that can be used to combine those parsers into more complicated parsers.

There are primitive parsers that parse epsilon (the empty string), chars, ints, floats, etc.

There are operations which combine parsers to create new parsers. For instance, the Kleene star operation takes an existing parser p and creates a new parser that matches zero or more occurrences of whatever p matches. Both callable objects and operator overloads are used for the combining operations. For instance, operator*() is used for Kleene star, and you can also write repeat(n)[p] to create a parser for exactly n repetitions of p.

Boost.Parser also tries to accommodate the multiple ways that people often want to get a parse result out of their parsing code. Some parsing may best be done by returning an object that represents the result of the parse. Other parsing may best be done by filling in a preexisting data structure. Yet other parsing may best be done by parsing small sections of a large document, and reporting the results of subparsers as they are finished, via callbacks. Boost.Parser accommodates all these ways of working, and even makes it possible to do callback-based or non-callback-based parsing without rewriting any code (except by changing the top-level call from parse() to callback_parse()).

All of Boost.Parser's public interfaces are sentinel- and range-friendly, just like the interfaces in std::ranges.

Boost.Parser is Unicode-aware through and through. When you parse ranges of char, Boost.Parser does not assume any particular encoding — not Unicode or any other encoding. Parsing of inputs other than plain chars assumes that the input is Unicode. In the Unicode-aware code paths, all parsing is done by matching code points. This means that you can feed UTF-8 strings into Boost.Parser, both as input and within your parser, and the right sort of matching occurs. For instance, if your parser is trying to match repetitions of the char '\xcc' (which is a lead byte from a UTF-8 sequence, and so is malformed UTF-8 if not followed by an appropriate UTF-8 code unit), it will not match the start of "\xcc\x80" (UTF-8 for the code point U+0300). Boost.Parser knows that the matching must be whole-code-point, and so it interprets the char '\xcc' as the code point U+00CC.

Error reporting is important to get right, and it is important to make errors easy to understand, especially for end-users. Boost.Parser produces runtime parse error messages that are very similar to the diagnostics that you get when compiling with GCC and Clang (it even supports warnings that don't fail the parse). The exact token associated with a diagnostic can be reported to the user, with the containing line quoted, and with a marker pointing right at the token. Boost.Parser takes care of this for you; your parser does not need to include any special code to make this happen. Of course, you can also replace the error handler entirely, if it doesn't fit your needs.

Debugging complex parsers can be a real nightmare. Boost.Parser makes it trivial to get a trace of your entire parse, with easy-to-read (and very verbose) indications of where each part of the trace is within the parse, the state of values produced by the parse, etc. Again, you don't need to write any code to make this happen — you just pass a parameter to parse().