Automatic syntax error reporting and recovery in parsing expression grammars

Error recovery is an essential feature for a parser that should be plugged in Integrated Development Environments (IDEs), which must build Abstract Syntax Trees (ASTs) even for syntactically invalid programs in order to offer features such as automated refactoring and code completion. Parsing Expressions Grammars (PEGs) are a formalism that naturally describes recursive top-down parsers using a restricted form of backtracking. Labeled failures are a conservative extension of PEGs that adds an error reporting mechanism for PEG parsers, and these labels can also be associated with recovery expressions to provide an error recovery mechanism. These expressions can use the full expressivity of PEGs to recover from syntactic errors. Manually annotating a large grammar with labels and recovery expressions can be diﬃcult. In this work, we present two approaches, Standard and Unique, to automatically annotate a PEG with labels, and to build their corresponding recovery expressions. The Standard approach annotates a grammar in a way similar to manual annotation, but it may insert labels incorrectly, while the Unique approach is more conservative to annotate a grammar and does not insert labels incorrectly. We evaluate both approaches by using them to generate error recovering parsers for four programming languages: Titan, C, Pascal and Java. In our evaluation, the parsers produced using the Standard approach, after a manual intervention to remove the labels incorrectly added, gave an acceptable recovery for at least 70

Error recovery is an essential feature for a parser that should be plugged in Integrated Development Environments (IDEs), which must build Abstract Syntax Trees (ASTs) even for syntactically invalid programs in order to offer features such as automated refactoring and code completion. Parsing Expressions Grammars (PEGs) are a formalism that naturally describes recursive top-down parsers using a restricted form of backtracking. Labeled failures are a conservative extension of PEGs that adds an error reporting mechanism for PEG parsers, and these labels can also be associated with recovery expressions to provide an error recovery mechanism. These expressions can use the full expressivity of PEGs to recover from syntactic errors. Manually annotating a large grammar with labels and recovery expressions can be difficult. In this work, we present two approaches, Standard and Unique, to automatically annotate a PEG with labels, and to build their corresponding recovery expressions. The Standard approach annotates a grammar in a way similar to manual annotation, but it may insert labels incorrectly, while the Unique approach is more conservative to annotate a grammar and does not insert labels incorrectly. We evaluate both approaches by using them to generate error recovering parsers for four programming languages: Titan, C, Pascal and Java. In our evaluation, the parsers produced using the Standard approach, after a manual intervention to remove the labels incorrectly added, gave an acceptable recovery for at least 70% of the files in each language. By it turn, the acceptable recovery rate of the parsers produced via the Unique approach, without the need of manual intervention, ranged from 41% to 76%.

Introduction
Integrated Development Environments (IDEs) often require parsers that can recover from syntax errors and build syntax trees even for syntactically invalid programs, in other to conduct further analyses necessary for IDE features such as automated refactoring and code completion.
Parsing Expression Grammars (PEGs) [1] are a formalism used to describe the syntax of programming languages, as an alternative for Context-Free Grammars (CFGs). We can view a PEG as a formal description of a recursive top-down parser for the language it describes. PEGs have a concrete syntax based on the syntax of regexes, or extended regular expressions. Unlike CFGs, PEGs avoid ambiguities in the definition of the grammar's language by construction, due to the use of an ordered choice operator.
The ordered choice operator naturally maps to restricted (or local) backtracking in a recursive top-down parser. The alternatives of a choice are tried in order; when the first alternative recognizes an input prefix, no other alternative of this choice is tried, but when an alternative fails to recognize an input prefix, the parser backtracks to the same input position it was before trying this alternative and then tries the next one.
A naive interpretation of PEGs is problematic when dealing with inputs with syntactic errors, as a failure during parsing an input is not necessarily an error, but can be just an indication that the parser should backtrack and try another alternative. Labeled failures [2,3] are a conservative extension of PEGs that address this problem of error reporting in PEGs by using explicit error labels, which are distinct from a regular failure. We throw a label to signal an error during parsing, and each label can then be tied to a specific error message.
We can leverage the same labels to add an error recovery mechanism, by attaching a recovery expression to each label. This expression is just a regular parsing expression, and it usually skips the erroneous input until reaching a synchronization point, while producing a dummy AST node [4,5].
Labeled failures produce good error messages and error recovery, but they can add a considerable annotation burden in large grammars, as each point where we want to signal and recover from a syntactic error must be explicitly marked.
In a previous work [6], we presented the Algorithm Standard, which automatically annotates a PEG with labels and builds their corresponding recovery expressions. We evaluated the use of such algorithm to build an error recovering parser for the Titan programming language.
This paper extends the previous one by also evaluating the use of Algorithm Standard to build error recovering parsers for C, Pascal and Java.
As pointed out in [6], Algorithm Standard may add some labels incorrectly, which would prevent the parser from recognizing syntactically valid programs.
In this paper we try to address this issue by proposing the Algorithm Unique, which inserts labels in a more conservative way. The use of Algorithm Unique avoids the problem of adding labels incorrectly, although it inserts less labels than Algorithm Standard.
Overall, our experiments show that Algorithm Standard can be used to produce error recovering parsers with the help of manual intervention, which was small in case of our Titan, C, and Pascal grammars, and more significant in case of Java. By its turn, Algorithm Unique can be used to automatically generate functional error recovering parsers, whose error recovery quality is lower when compared to the parsers got via Algorithm Standard.
The remainder of this paper is organized as follows: Section 2 discusses error recovery in PEGs using labeled failures and recovery expressions; Section 3 shows Algorithm Standard, which automatically annotates a PEG with labels and associates a recovery expression to each label; Section 4 evaluates the use of Algorithm Standard to annotate the grammars of four programming languages: Titan, C, Pascal, and Java; Section 5 discusses conservative approaches to insert labels and presents Algorithm Unique, which inserts only correct labels; Section 6 compares the use of both algorithms to annotate Titan, C, Pascal and Java grammars; Section 7 discusses related work on error reporting and error recovery; finally, Section 8 gives some concluding remarks.

Error recovery in PEGs with labeled failures
In this section we present an introduction to labeled PEGs and discuss how to build an error recovery mechanism for PEGs by attaching a recovery expression to each labeled failure.
A labeled PEG G is a tuple (V , T , P , L, R, fail, p S ), where V is a finite set of non-terminals, T is a finite set of terminals, P is a total function from non-terminals to parsing expressions, L is a finite set of labels, R is a function from labels to parsing expressions, fail / ∈ L is a failure label, and p S is the initial parsing expression. We will use the term recovery expression when referring to the parsing expression associated with a given label. We will assume that where V Lex is the set of non-terminals that match lexical elements, also known as tokens, and V S yn represents the non-terminals that match syntactical elements. When describing the PEG for a given language, we will use names in uppercase for the lexical non-terminals. From now on, unless otherwise noted, we will use PEG as synonymous to labeled PEG.
We describe the function P as a set of rules of the form A ← p, where A ∈ V and p is a parsing expression. A parsing expression p, when applied to an input string s, either succeeds or fails. When the matching of p succeeds, it consumes a prefix of the input and returns the remaining suffix, and when it fails, it produces a label, associated with an input suffix. The abstract syntax of parsing expressions is as follows, where p, p 1 and p 2 are parsing expressions: ε represents the empty string, a ∈ T denotes a terminal, A ∈ V represents a non-terminal, p 1 p 2 is a concatenation, p 1 / p 2 is an ordered choice, p * indicates zero or more repetitions, !p is a negative predicate, and ⇑ l throws a label l ∈ L. Fig. 1 presents the semantics of labeled PEGs with error recovery as a set of inference rules for a PEG function. The notation G[p] R xy PEG y represents a successful matching of the parsing expression p in the context of a PEG G against the subject xy with a map R from labels to recovery expressions, consuming x and leaving the suffix y. By its turn, the

Recovery
represents an unsuccessful match of p, where label f ∈ L ∪ {fail}, was thrown when trying to match the suffix y. We will usually use f to represent a label in the set L ∪ {fail}, and l to represent a label in L. The notation G[p] R xy PEG X indicates that the matching result can be either y or ( f , y).
The semantics given here is essentially the same semantics for PEGs with labels presented in previous work [5,4], with two simplifications: neither we are tracking the farthest failure, nor keeping a list of the errors that occurred during a match. We did this to make more amenable a formal discussion about the correct insertion of labels.
We can see in Fig. 1 that failing to match a terminal (rules term.2 and term.3) gives us the label fail, while a throw expression (rules throw.1 and throw.2) may give us a label different from fail. The recovery map R is simply passed along. The exceptions are the rules for the syntactic predicate and for throwing labels.
A label l = fail thrown by ⇑ l cannot be caught by an ordered choice or a repetition (rules ord.2 and rep.2), so it indicates an actual error during parsing, while fail is a regular failure and it indicates that the parser should backtrack. In the original formalization of PEGs [1], there is only label fail, thus the parser always tries to backtrack after failing to match a parsing expression.
The lookahead operator ! captures any label and turns it into a success (rule not.1), while turning a success into a fail label (rule not.2). In both rules we used an empty recovered map to make sure that errors are not recovered inside the predicate. The rationale is that errors inside a syntactic predicate are expected and not actually syntactic errors in the input. Rule throw.1 is related to error reporting, while rule throw.2 is where error recovery happens. R(l) denotes the recovery expression associated with the label l. When a label l is thrown we check if R has a recovery expression associated with it. If it does not (throw.1), the matching result is l plus the current input, and this error is propagated, so parsing finishes after reaching the first syntactical error.
If label l has a recovery expression R(l) (rule throw.2), we try to match the current input by using R(l). As R(l) is a regular parsing expression, its matching may succeed, which essentially resumes regular parsing, or may fail, which may finish the parsing or not (the parser can still recover from this second error).
When a PEG does not throw labels via expression ⇑ l , we say it is an unlabeled PEG, as the following definition states: Definition 1 (Unlabeled PEG). A PEG G = (V , T , P , L, R, fail, p S ) is unlabeled when ∀A ∈ V we have that expression ⇑ l does not appear in P (A).
In an unlabeled PEG G, the function R is not relevant, as no label different from fail is thrown and thus rules throw.1 and throw.2 will not be used. In this case, the result of a matching is more specific, as stated by the following lemma, where x and x are suffixes of x: Below, we discuss an example that illustrates how to deal with syntax errors in PEGs by using labeled failures and recovery expressions.

Handling syntax errors in PEGs
In Fig. 2 we can see a PEG for a tiny subset of Java, where lexical rules (shown in uppercase) have been elided. While simple (this PEG is almost equivalent to an LL(1) CFG), this subset is a good starting point to discuss error recovery in the context of PEGs.
To get a parser with error recovery, we first need to have a parser that correctly reports errors. One popular error reporting approach for PEGs is to report the farthest failure position [7,3], an approach that is supported by PEGs with labels [4]. However, the use of the farthest failure position makes it harder to recover from an error, as the error is only known after parsing finishes and all the parsing context at the moment of the error has been lost. Because of this, we will focus on using labeled failures for error reporting in PEGs. System.out.println(f); 10 } 11 } We need to annotate our original PEG with labels, which indicate the points where we can signal a syntactical error. The strategy we used to annotate the grammar was to annotate every symbol (terminal or non-terminal) in the righthand side of a production that should not fail, as a failure would just make the whole parser either fail or not consume the input entirely. For a nearly LL(1) grammar, like the one in our example, that means all symbols in the right-hand side of a production, except the first one. We apply the same strategy when the right-hand side has a choice or a repetition as a subexpression.
We can associate each label with an error message. For example, in rule whileStmt the label rparwwhile is thrown when we fail to match a closing parenthesis, so we could attach an error message like "missing ')' in while" to this label. Dynamically, when the matching of RPAR fails and we throw rparwhile, we could enhance this message with information related to the input position where this error happened.
Let us consider the example Java program from Fig. 4, which has two syntax errors: a missing ')' at line 5, and a missing ';' at the end of line 7. For this program, a parser based on the labeled PEG from Fig. 3 would give us a message like: factorial.java:5: syntax error, missing ')' in while The second error will not be reported because the parser did not recover from the first one, since rparwhile still has no recovery expression associated with it.
The recovery expression p r of an label l matches the input from the point where l was thrown. If p r succeeds then regular parsing is resumed as if the label had not been thrown. Usually p r should just skip part of the input until is safe to resume parsing. In rule whileStmt, we can see that after the ')' we expect to match a stmt, so the recovery expression of label rparwhile could skip the input until it encounters the beginning of a statement.
In order to define a safe input position to resume parsing, we will use the classical F I R ST and F O LL O W sets. A more detailed discussion about F I R ST and F O LL O W sets in the context of PEGs can be found in other papers [8][9][10].
With the help of these sets, we can define the following recovery expression for rparwhile, where eatToken is a rule that matches an input token:

(!FIRST(stmt) eat T oken) *
Now, when label rparwhile is thrown, its recovery expression matches the input until it finds the beginning of a statement, and then regular parsing resumes.
The parser will now also throw label semiassign and report the second error, the missing semicolon at the end of line 7. In case semiassign has an associated recovery expression, this expression will be used to try to resume regular parsing again.
Even our toy grammar has 26 distinct labels, each needing a recovery expression to recover from all possible syntactic errors. While most of these expressions are trivial to write, this is still burdensome, and for real grammars the problem is compounded by the fact that they can easily need a small multiple of this number of labels. In the next section, we present an approach to automatically annotate a grammar with labels and recovery expressions in order to provide a better starting point for larger grammars.

Automatic insertion of labels and recovery expressions
The use of labeled failures trades better precision in error messages, and the possibility of having error recovery, for an increased annotation burden, as the grammar writer is responsible for annotating the grammar with the appropriate labels. In this section, we show how this process can be partially automated.
To automatically annotate a grammar, we need to determine when it is safe to signal an error: we should only throw a label after expression p fails if that failure always implies that the whole parse will fail or not consume the input entirely, so it is useless to backtrack. This is easy to determine when we have a nearly LL(1) grammar, as is the case with the PEG from Fig. 2. As we mentioned in Section 2, for an LL(1) grammar the general rule is that we should annotate every symbol (terminal or non-terminal) in the right-hand side of a production after consuming at least one token, which in general leads to annotating every symbol in the right-hand side of a production except the first one.
Although many PEGs are not LL(1), we can use this approach to annotate what would be the LL(1) parts of a non-LL(1) grammar. We will discuss some limitations of this approach in the next section, when we evaluate its application to annotate PEG-based parsers for the programming languages Titan, C, Pascal and Java.
While annotating a PEG with labels we can add an automatically generated recovery expression for each label, based on the tokens that could follow it. We assume the tokens of a grammar are described by the non-terminals A ∈ V Lex . Moreover, we also assume that at most one non-terminal A ∈ V Lex matches a prefix of the current input, as stated by the following definition: In the above definition, we assumed an unlabeled PEG to make sure we would not recover from an error when matching a lexical non-terminal. Alternatively, we could have considered above a labeled PEG with an empty recovery function.
By assuming a grammar with the unique token prefix property we did not have to worry about which lexical nonterminal should come first in a choice (e.g., an alternative that matches "=" can come before one that matches "=="). Such property is useful, for example, when automatically computing a choice with the tokens that a recovery expression should match. The unique token prefix property can be easily achieved with the help of predicates. For example, we could define a non-terminal to match input "=" as AT R I B ← '=' !'='.
Moreover, when a PEG G has the unique token prefix property the sequence of tokens matched for a given input is unique, as stated below, where we assumed, as previously, an unlabeled PEG to avoid recovering in case of an error: Lemma 2 (Unique token sequence). Given an unlabeled PEG G = (V Lex ∪ V S yn , T , P , L, R, fail, p S ), with the unique token prefix property, and a subject w, the sequence in which the lexical non-terminals in V Lex match w is unique.
Proof. By contradiction. Assume the sequence is not unique. This implies that for some suffix ax of w we would have that if p = A and ε / ∈ F I R ST (A) and seq then 9: return addlab(p, f lw) 10: else if p = p 1 p 2 then 11: 13: return p x p y 14: else if p = p 1 / p 2 then 15: 18: 19: if seq and ε / ∈ F I R ST (p 1 / p 2 ) then 20: return addlab(p x / p y , f lw) 21: else 22: return p x / p y 23: 26: return p 27: 28: function calck(p, f lw) 29: if ε ∈ F I R ST (p) then 30: return their results regarding to the grammar G passed to function annotate. We also assume grammar G from function annotate is available in function addlab.
Function annotate (lines 1-5) generates a new annotated grammar G from a grammar G. It uses labexp (lines 7-26) to annotate the right-hand side, a parsing expression, of each syntactical rule of grammar G. The auxiliary function calck (lines 28-32) is used to update the F O LL O W set associated with a parsing expression. By its turn, the auxiliary function addlab (lines 34-37) receives a parsing expression p to annotate and its associated F O LL O W set f lw. Function addlab associates a label l to p and also builds a recovery expression for l based on f lw. The expression eat T oken, which matches an input token, can be generated from the lexical rules of G. We assume G has the unique prefix property when computing eat T oken automatically.
Algorithm Standard annotates every right-hand side, instead of going top-down from the root, to not be overly conservative and fail to annotate non-terminals reachable only from non-LL(1) choices but which themselves might be LL(1). We will see in Section 4 that this has the unfortunate result of sometimes changing the language being parsed, which is the major shortcoming of Algorithm Standard. Function labexp has three parameters. The first one, p, is a parsing expression that we will try to annotate. The second parameter, seq, is a boolean value that indicates whether the current concatenation consumes at least one token before p or not. Finally, the parameter f lw represents the F O LL O W set associated with p. Let us now discuss how labexp tries to annotate p.
When p is a non-terminal expression and it is part of a concatenation that already matched at least one token (lines 8-9), then we associate a new label with p. In case p represents a non-terminal but seq is not true, we will just return p itself (lines [25][26]. In line 8, we also test whether A matches the empty string or not. This avoids polluting the grammar with labels which will never be thrown, since a parsing expression that matches the empty string does not fail. In case of a concatenation p 1 p 2 (lines 10-13), we try to annotate p 1 and p 2 recursively. To annotate p 1 we use an updated F O LL O W set, and to annotate p 2 we set its parameter seq to true whenever seq is already true or p 1 does not match the empty string.
In case of a choice p 1 / p 2 (lines 14-22), we annotate p 2 recursively and in case the choice is disjoint we also annotate p 1 recursively. In both cases, we pass the value false as the second parameter of labexp, since failing to match the first symbol of an alternative should not signal an error. When seq is true, we associate a label to the whole choice when it does not match the empty string.
In case p is a repetition p 1 * (lines 23-24), we can annotate p 1 if we have a disjoint repetition, i.e., if there is no intersection between F I R ST (p 1 ) and f lw. When annotating p 1 we pass false as the second parameter of labexp because failing to match the first symbol of a repetition should not signal an error.
Our concrete implementation of Algorithm Standard also adds labels in case of repetitions of the form p 1 +, which should match p 1 at least once, and p 1 ?, which should match p 1 at most once. As these cases are similar to the case of p 1 * , we will not discuss them here.
Given the PEG from Fig. 2, function annotate would give us the grammar presented in Fig. 3 (as previously, we are not taking rule prog into consideration), with the exception of the annotation [stmt] elsestmt . Label elsestmt was not inserted at this point because token ELSE may follow the choice ELSE stmt / ε, so this choice is not disjoint (the well-known dangling else problem). In Fig. 3, we associated the label elsestmt to stmt. This indicates that an else must be associated with the nearby if statement.
It is trivial to change the algorithm to leave any existing labels and recovery expressions in place, or to add recovery expressions to any labels that are already present but do not have recovery expressions.
After applying Algorithm Standard to automatically insert labels, a grammar writer can later add (or remove) labels and their associated recovery expressions. We discuss more about this on the next section, where we evaluate the use of Algorithm Standard to add error recovery for the parsers of several programming languages.

Evaluating Algorithm Standard
To evaluate Algorithm Standard, we built PEG parsers for the programming languages Titan, C, Pascal and Java. To build such parsers we used LPegLabel, 1 a tool that implements the semantics of PEGs with labeled failures, and pegparser, 2 which automatically adds labels and recovery expressions to a PEG. When building the parsers, we focused on the syntactical rules, so we have omitted or simplified some lexical rules.
For each language, we first wrote an unlabeled version of the grammar based on some reference grammar. We have tried to follow the reference grammar syntactic structure to avoid a bias that could favor our algorithm. We used a set of syntactically valid and invalid programs to validate each parser.
Given an unlabeled grammar, we used pegparser to got an automatically annotated grammar following Algorithm Standard, with a recovery expression associated to each label. We will use the term generated when referring to this annotated grammar.
We will compare the generated grammar with a manually annotated grammar obtained from the unlabeled grammar. We used the same set of syntactically valid and invalid programs to validate the generated grammar and the manually annotated one.
In our comparison, we will check the labels of the generated grammar against the labels of the manually annotated grammar. We will discuss mainly the following items: Equal When the algorithm correctly inserted a label, as the manual annotation did. Extra When the algorithm correctly inserted a new label. Wrong When the algorithm incorrectly inserted a label. Table 1 shows the result of comparing the automatically inserted labels with the manually ones. Below, in Sections 4.1, 4.2, 4.3 and 4.4 we discuss the automatic insertion of labels for each language.
Ideally, we would want a generated grammar with the same labels as the manually annotated one, hopefully with a few new correct labels missed during manual annotation. To a certain extent, we do not consider missing to add some labels a serious flaw of Algorithm Standard, as long as most of the labels are correctly inserted, since failing to add labels does not lead to an incorrect parser. These (hopefully few) labels can still be manually inserted later by an expert.
A discrepancy related to Item Wrong is more problematic, since it can produce a parser that does not recognize some syntactically valid programs. This limitation of our algorithm means that the output needs to be checked by the parser developer to ensure that the algorithm did not insert labels incorrectly. This checking can be done either by manual inspection of the grammar or by running the generated parser against test programs. In this latter case, when the parser fails to recognize a valid program, the parsing result will point the label incorrectly added. Once identified, we need to remove the incorrect label from the grammar.
After analyzing how Algorithm Standard annotated the grammar of a given language, we will discuss the error recovering parser generated by it. During this discussion we will assume that we have already removed the labels that Algorithm Standard may have inserted incorrectly.
As we mentioned, Algorithm Standard associates a recovery expression to each label. To recover from a label l we add a recovery rule l to the grammar, where the right-hand side of l is its recovery expression. The generated grammar has a recovery rule associated with each label.
As pegparser automatically builds an AST when the match is successful, we will evaluate the error recovering parser got from a generated grammar by comparing the AST built by the parser for a syntactically invalid program with the AST of what would be an equivalent correct program. For the AST leaves associated with a syntax error, we do not require their contents to be the same, just the general type of the node, so we are comparing just the structure of the ASTs.
Based on this strategy, a recovery is excellent when it gives us an AST equal to the intended one. A good recovery gives us a reasonable AST, i.e., one that captures most information of the original program (e.g., it does not miss a whole block of commands). A poor recovery, by its turn, produces an AST that loses too much program information. Finally, a recovery is rated as awful whenever it gives us an AST without any information about the program. Table 2 shows for how many programs of each language the recovery strategy we implemented was considered excellent, good, poor, or awful. Sections 4.1, 4.2, 4.3 and 4.4 discuss the results of error recovery for each language. In case of the manually annotated grammars, to evaluate them we added recovery rules based on the way Algorithm Standard generates recovery rules for labels.
To illustrate how we rated a recovery, let us consider the following syntactically invalid Titan program, where the range start of the for loop was not given at line 2: 1 sum = 0 2 for i = , 10 do 3 print(i) 4 sum = sum + i 5 end A recovery would be excellent in case the AST has all the information associated with this program (such AST should have a dummy node to represent the range start). A recovery would be good in case the resulting AST misses only the information about the loop range. By its turn, a recovery would be rated as poor in case the resulting AST misses the statements inside the for (lines 3 and 4). Lastly, we would rate a recovery as awful in case it would have produced an AST only with dummy nodes.
Below, based on the approach discussed previously, we evaluate the use of Algorithm Standard to generate error recovering parsers for the programming languages Titan, C, Pascal and Java.

Titan
Titan [11] is a new statically-typed programming language under development to be used as a sister language to the Lua programming language [12].
After some initial development, the Titan parser was manually annotated with labels to improve its error reporting. The original Titan parser 3 has no error recovery, it stops parsing the input after encountering the first syntax error. Based on it, we wrote our unlabeled grammar for Titan, 4 which has 50 syntactical rules.
The Titan grammar is not LL (1), there are non-LL(1) choices in 7 rules and non-LL(1) repetitions in 3 rules, but it has many LL(1) parts.
The manually annotated Titan grammar 5 we got from our unlabeled grammar is equivalent to the original Titan grammar, we have just adapted the grammar syntax to be able to use the pegparser tool. The manually annotated grammar has 86 expressions that throw labels. Some labels, such as EndFunc, are thrown more than once, i.e., they are associated with more than one expression. We then applied Algorithm Standard to this unlabeled grammar and got an automatically annotated Titan grammar, with a recovery expression associated to each label. 6 In Section 4.1.1, we compare the labels automatically inserted with the labels in the original Titan grammar. Then, in Section 4.1.2, we will discuss the error recovery mechanism of the generated Titan grammar.

Automatic insertion of labels
Algorithm Standard annotated the Titan grammar with 80 labels, which is close to the 86 labels of the original Titan grammar. A manual inspection revealed that usually the algorithm inserted labels at the same location of the original ones, as Table 1a shows. We could insert automatically around 90% of the labels inserted manually. Below we discuss the main issues related to the generated Titan grammar.
As expected our approach did not annotate parts of the grammar where the alternatives of a choice were not disjoint, on in case of a non-disjoint repetition. This happened in 4 of the 50 grammar rules. One of these rules was castexp, which we show below: castexp ← simpleexp AS t ype / simpleexp As we can see, both alternatives of the choice match a simpleexp, so these alternatives are not disjoint. After manual inspection, we can see it is possible to add a label to type in the first alternative, since the context where castexp appears in the rest of the grammar makes it clear that a failure on type is always a syntax error. Left-factoring the right-hand side of castexp to simpleexp (AS t ype / ε), or using the short form simpleexp (AS t ype)?, would give enough context for Algorithm Standard to correctly annotate type with a label, though.
The manually annotated Titan grammar uses an approach known as error productions [13]. As an example, the choice associated with rule statement has two extra alternatives whose only purpose it to match some usual syntactically invalid statements, in order to provide a better error message. One of these alternatives is as follows: Before this alternative, the grammar has one that tries to match an assignment statement. That alternative might have failed because the programmer used an expression that is not a valid l-value in the left-hand side of the assignment. This error production guards against this case. Without the error production, the parser would still fail, but we would get an error related to not closing a function, which may be confusing for a user.
The Algorithm Standard does not add error productions, and we think they should only be added by an expert. In case of Titan, the algorithm inserted two labels incorrectly, a problem related to Item Wrong, which made the parser reject valid inputs. Although these two labels have also been added during the manual annotation, their insertion by Algorithm Standard was undue, as we will see. This issue happened in rules toplevelvar and import. Fig. 5 shows the definition of these rules, plus some rules that help to add context, in the manually annotated Titan grammar.
Non-terminals toplevelvar, import and foreign are alternatives of a non-LL(1) choice in rule program. The parser first tries to recognize toplevelvar, then import, and finally foreign. As a decl may consist of only a name, an input like "local x =" may be the beginning of any of these rules. In rule toplevelvar, the predicate !(IMPORT / FOREIGN) was added by the Titan developers to make sure the input neither matches the import nor the foreign rule, so it is safe to throw an error after this predicate in case we do not recognize an expression. The predicate !FOREIGN in rule import plays a similar role.
As Titan developers inserted these predicates solely to enable the subsequent label annotations, we judged that we would do a fairer evaluation by removing them from our unlabeled grammar.
In rule program, although alternatives toplevelvar, import, and foreign have LOCAL in their F I R ST sets, the algorithm adds labels to the right-hand side of these non-terminals, because it does not take into consideration the fact these non-terminals appear as alternatives in a non-LL(1) choice.
The outcome is that the algorithm is able to insert the same labels added by manual annotation, but without the syntactic predicates we should not throw label AssignImport in rule toplevelvar and label ImportImport in rule import.
As Algorithm Standard inserted these labels, the resulting parser will wrongfully signal errors in valid inputs such as "local x = import "foo"".
After removing these labels, our generated Titan parser successfully passed the Titan tests. We think this was less work than manually annotating the grammar, given that the parser already needs to have an extensive test suite that will catch these errors, as was the case in our evaluation.
Lastly, Algorithm Standard correctly added two new labels. It annotated RARROW in the first alternative of rule type, and FOREIGN in rule foreign.

Automatic error recovery
The test suite of Titan has 74 tests related to syntactically invalid programs. For our evaluation of automatic error recovery, we ran the Titan parser against these files and we analyzed the AST built for each of them. Since that our parser will only build an AST for a successful matching, the grammar start rule should not fail. Thus, as a special case, we should annotate the expressions of the grammar start rule which may lead to a failure. In case of Titan, we should annotate EOF and add a recovery rule that consumes the rest of the input. By doing this, we will get an AST whenever we successfully match an input prefix before matching EOF. We will use this same approach for the other languages. It is not difficult to extend the Algorithm Standard with this extra case involving the start rule.
We can see in Table 2a that our recovery mechanism for Titan seems promising, since that more than 80% of the recovery done was considered acceptable, i.e., it was rated at least good.
By analyzing the programs for which our parser built a poor AST, we can see that most cases (9 out of 11) are related to missing labels. Instead of throwing such labels and recovering from them using their corresponding recovery expressions, the generated parser will produce a regular failure, which either leads to the failure of a matching or makes the parser backtrack.
As an example, let us see the case of a missing label related to rule castexp, which we have shown in Section 4.1.1. In the following input there is a missing type after the keyword "as" at line 1:

x = foo as 2 return x
The manually annotated parser would have thrown an error after "as". However, as we have discussed in Section 4.1.1, Algorithm Standard did not annotate this rule. Thus, the automatically generated parser will produce a regular failure after failing to match type after "as".
This leads the first alternative of rule castexp to fail, then the second alternative matches just the input "foo". This will lead to another failure when the parser tries to match "as" as the beginning of a statement.
As Algorithm Standard was able to insert most of the labels inserted by manual annotation, usually the generated Titan parser was able to recover from an syntactic error and to build an AST with nearly all the information about a program.

C
We have developed a parser for C, without preprocessor directives, based on the reference grammar presented by Kernighan and Ritchie [14], which is essentially a grammar for ANSI C89.
To write our unlabeled grammar for C 7 we needed to remove left-recursion, as LPegLabel does not accept grammars with left-recursive rules. After this, we got an unlabeled grammar for C with 50 syntactical rules, from which 17 have non-LL(1) choices and 5 have non-LL(1) repetitions.
Due to the typedef feature, to correctly recognize the C syntax we need the help of semantic actions to determine when a name should be considered a typedef_name. As we did not implement these semantic actions, we disabled the matching of this rule to not incorrectly recognize an identifier as a typedef_name.
The manually annotated C grammar 8 has 87 expressions that throw labels. By its turn, the automatically annotated C grammar 9 we got after applying Algorithm Standard has 75 labels.
In Section 4.2.1, we compare the manually annotated C grammar with the automatically annotated one. After, in Section 4.2.2, we will discuss the error recovering C parser we got from this automatically annotated grammar.

Automatic insertion of labels
As was the case for Titan, often the Algorithm Standard inserted labels at the same location of the original ones, as Table 1b shows. The algorithm was able to insert 75% of the labels inserted manually.
As our C grammar has many rules with non-LL(1) choices (17 out of 50), and some rules with non − LL(1) repetitions too, it was not possible to automatically add some labels in these rules.
Algorithm Standard incorrectly added one new label, in rule function_def. Fig. 6 shows the definition of this rule, plus other rules that help to add context, in the generated C grammar.
The cause of the problem related to Item Wrong in the C grammar is similar to the one discussed in Titan grammar in Section 4.1.1. In rule external_decl, we have a non-LL(1) choice, since that a decl_spec may be the beginning of a function_def as also of a decl.
When we annotate the right-hand side of the rule associated with non-terminal function_def, which appears in the first alternative of the non-LL(1) choice in rule external_decl, we may throw a label incorrectly. In this case, given an input like "int x;", we would match "int" as a decl_spec and we would throw label ErrFuncDef after failing to recognize "x;" as a function_def. After removing label ErrFuncDef, our generated C parser successfully passed the tests.
Finally, Algorithm Standard added 9 new labels correctly, which is more than the 2 new labels added for the Titan grammar. We think this may be due to the higher rate of non-disjoint expressions in our C grammar, which may have imposed a more conservative behavior during manual annotation.
Nevertheless, the manual annotation is not free of faults. For both grammars some labels were added during manual annotation and later removed when the parser failed to recognize syntactically valid programs.

Automatic error recovery
The test suite we used for our C parser has 59 syntactically invalid programs. As we did for Titan, we ran the generated C parser against these files and we analyzed the AST built for each of them. As we discussed in Section 4.1.2, we manually added labels to the grammar start rule to assure our parser will build an AST when it successfully matches an input prefix. In the case of the C grammar, we added two labels to the right-hand side of the grammar start rule.
In Table 2b we can see that for more than 70% of the syntactically invalid programs in our test set the recovery done was considered acceptable, i.e., it was rated at least good.
Similarly to Titan (see Section 4.1.2), in most cases (12 out of 16) we can associate the building of a poor AST by our parser with the absence of a label.
As our C grammar has more non-LL(1) choices, Algorithm Standard missed more labels, which makes a proper recovery more difficult and results in more poor ASTs. As an example, let us see the case of a missing label related to an if-else statement. Fig. 7 shows the definition of such statement in rule stat of the manually annotated C grammar. Other alternatives of rule stat were omitted for simplicity.
As the choice in stat is not LL(1), Algorithm Standard will not add the 5 labels to the first alternative of this choice.
Given a program as the following one, where there is no statement associated with the else: The generated C parser will try to recognize the first alternative of the choice in rule stat. It will fail to recognize stat after "else", which will produce a regular failure. Thus, the parser backtracks, recognize an if -statement without an else-part, and then will fail to recognize another statement as we left "else" on the input.
As we commented out in Section 4.1.1, we could rewrite this choice to put in evidence the common prefix. After doing this, Algorithm Standard could annotate the if -statement and we would get a better recovery in this case.
Although Algorithm Standard will not annotate LPAR in the first alternative of the choice above, this will not make error recovery worst in case of a missing "(" after "if", as long as we annotate LPAR in the second alternative. The reason for this is that after failing to match LPAR via the first alternative, the parser will backtrack and eventually match LPAR via the second alternative. The same rationale applies for the other labels present in the common prefix of both alternatives.

Pascal
We have developed a parser for Pascal based on the grammar available in the ISO 7185:1990 standard [15]. Our unlabeled Pascal grammar 10 has 67 syntactical rules. Among these rules, 4 of them have non-LL(1) choices, and 6 of them have non-LL(1) repetitions.
The manually annotated Pascal grammar 11 has 102 expressions that throw labels. By using Algorithm Standard, from the unlabeled Pascal grammar we got a generated grammar 12 with 104 labels. Below, Section 4.3.1 compares the manually annotated grammar with the generated one, and Section 4.3.2 discusses the error recovering Pascal parser we got from this generated grammar.

Automatic insertion of labels
As Table 1c shows, Algorithm Standard annotated the Pascal grammar in a way nearly identical to manual annotation, it inserted 98% of the labels inserted manually. We think the low number of non-LL(1) choices and non-LL(1) repetitions helped the algorithm to achieve this performance.
However, three of the labels inserted by Algorithm Standard were added incorrectly. The incorrect labels were added to rules subrangeType, assignStmt and funcCall. All these rules are referenced (directly or indirectly) in the first alternative of non-LL(1) choices, where an identifier belong to the F I R ST set of both choice alternatives. Let us discuss the problem related to assignStmt, whose definition is given in Fig. 8.
We can see in this figure that there is a non-LL(1) choice in rule simpleStmt, as ID belongs to the F I R ST set of both assignStmt and procStmt. Due to this, in rule assignStmt, which appears in the first alternative of this choice, we should not annotate ASSIGN, otherwise the parser will not recognize a valid procStmt such as "f(x)", as ":=" does not follow the identifier "f".
After removing the incorrect labels in rules subrangeType, assignStmt and funcCall, our generated Pascal parser successfully passed the tests.
Lastly, Algorithm Standard also added 2 new labels correctly.

Automatic error recovery
Our test suite for Pascal has 101 syntactically invalid programs. We can see in Table 2c that for more than 90% of the syntactically invalid programs in our test set the recovery done was considered acceptable, i.e., it was rated at least good.
Differently from the analysis we did for the Titan and the C error recovering parsers, in case of the Pascal parser we can not associate the poor ASTs with the absence of labels. A manual inspection indicates that most of poor ASTs built were due to synchronizing the input too early (instead of discarding one more token). This issue may be fixed by adjusting the recovery expression used. Our approach allows to do this tuning manually for a given recovery expression.
Overall, a recovery strategy may show a better performance after it is tuned to match features of a given language.

Java
We have developed a parser for Java 8 following the parser available at the Mouse site. 13 Our unlabeled Java grammar 14 has 147 syntactical rules, where there are 35 rules with a non-LL(1) choice and 15 rules with a non-LL(1) repetition. A rule may have a non-LL(1) choice and also a non-LL(1) repetition, but this occurs in only 2 rules. Overall, one third of the grammar rules has an LL(1) conflict. The manually annotated Java grammar 15 has 175 expressions that throw labels.
From the unlabeled Java grammar, we used Algorithm Standard to get a generated grammar 16 with 181 labels. In Section 4.4.1 we compare the manually annotated grammar with the generated one, and in Section 4.4.2 we discuss our error recovering parser for Java.

Automatic insertion of labels
We can see in Table 1d that Algorithm Standard annotated the Java grammar with 181 labels, from which 139 were also inserted during the manual annotation. This seems a good amount, given that many rules of the grammar have an LL(1) conflict.
The LL(1) conflicts also impose a difficult to add labels correctly. As a consequence of this, an important part of the labels added (18%) by Algorithm Standard were inserted incorrectly. The cases where these labels were inserted are similar to the cases of incorrect labels we have already discussed for the other languages, so we will not present them here.
The significant number of incorrect labels added limits somewhat the usefulness of using Algorithm Standard to annotate our unlabeled Java grammar, since that it is necessary to manually remove several labels later. Although this removal is not hard, the usual process requires running the tests once for each incorrect label, and then removing such label after failing to pass the tests.
Finally, Algorithm Standard also correctly added 10 new labels.

Automatic error recovery
Our test suite for Java has 175 syntactically invalid programs. Table 2d shows that for almost 80% of these programs the recovery done was considered acceptable, i.e., it was rated at least good.
About half of the cases where our generated parser built a poor AST are related to a missing label. We could get a better result in these cases by rewriting non-disjoint choices, as we have shown for Titan and C, so Algorithm Standard could insert more labels and their corresponding recovery rules.
For also about half of the cases we got a poor AST because of an intersection between the tokens that could follow a symbol in the right-hand side of a rule A and the tokens that could follow A itself. To improve these ASTs we usually need either to manually add labels to the grammar or to manually tune the recovery rules.

Conservative insertion of labels
As have discussed previously, Algorithm Standard annotates a grammar with labels, but it may add labels incorrectly, which leads to a parser that rejects some valid inputs. To avoid this shortcoming, we will discuss conservative approaches, which address the problem related to Item Wrong.  Fig. 9. Label DotDotErr Incorrectly Added in Rule subrangeType.

Non-terminals banning
Our first approach to not insert labels incorrectly is based on the idea of banning a non-terminal A that is used in a non-disjoint choice or a non-disjoint repetition. When A is banned, we do not annotate its right-hand side. To properly avoid the wrong insertion of labels, this approach should be recursive, i.e., when banning A we should also ban the non-terminals in the right-hand side of A.
To illustrate this point, let us consider Fig. 9, which shows an excerpt from Pascal grammar. In rule ordinalType there is a non-disjoint choice, where ID belongs to the F I R ST set of both alternatives of the choice. Because of this, we should ban the non-terminal newOrdinalType, so we will not annotate its right-hand side.
In case the banning process is not recursive, as in rule newOrdinalType there is no conflict, we will not ban the nonterminals in its right-hand side. This approach leads to incorrectly adding label DotDotErr in rule subrangeType. We should not throw DotDotErr because in rule ordinalType, when matching the first alternative of ne w O rdinalT ype / ID, the parser could recognize an ID as the beginning of a subrangeType, then fail to recognize DOTDOT, backtrack and finally match the second alternative. Thus, we should apply a recursive banning approach to avoid adding labels incorrectly.
The result of applying such approach leads to the insertion of a few labels, or even none. When there are conflicts in the top-level grammar rules, the recursive banning strategy bans almost all non-terminals. For the C and Java grammars, after banning the non-terminals related to a non-disjoint choice or repetition, we could not add a single label. In case of Titan, we could add 12 labels, while for Pascal, which has few non-disjointness conflicts, we had the best result and could add 36 labels, which corresponds to 35% of the labels we have inserted manually.
Although the recursive banning approach have added only correct labels, its usefulness seems quite limited. Therefore we will use this strategy only as a complementary one. Below, we discuss a more effective approach, based on the idea of unique non-terminals, to conservatively insert only correct labels.

Unique non-terminals
In Section 3 we saw that the main challenge when adding labels is to determine statically when failing to match an expression p indicates that the parser has no other viable option to recognize the input.
In order to identify these safe places where we can insert labels, we will introduce the concept of unique lexical non-terminals. The following definition says that a lexical non-terminal A is unique when it appears in the right-hand side of only one syntactical rule, and just once: lexical non-terminal). Given a PEG G = (V Lex ∪ V S yn , T , P , L, R, fail, p S ), A ∈ V Lex is unique iff ∃B ∈ V S yn such that A is used only once in P (B) and ∀C ∈ V S yn , where C = B, we have A is not used in P (C).
When we have a grammar G with the unique token prefix property, and A is a unique lexical non-terminal of G, once A matched, failing to match the expression the follows A leads to the failure of the whole matching, as the following lemma states:

an unlabeled PEG, with the unique token prefix property, and let w be a subject w. Let A p 2 be a subexpression of P (B), where A is a unique lexical non-terminal and B ∈ V S yn , and let axy be a suffix of w, if G[ A] R axy PEG y and G[p
Proof. The proof uses Lemma 2 and the fact that G has the unique token prefix property and A is a unique lexical nonterminal.
When the matching of p 2 fails, either we backtrack to a previous choice and try to match a different alternative, or we do not backtrack.
In the former case, by Lemma 2 we know that after backtracking the grammar will match the same sequence of tokens, thus we will need to match axy again. As G has the unique token prefix property, only A matches axy, and given that A is a unique lexical non-terminal, A is not used anywhere else in G. Therefore, once more A would match prefix ax and p 2 would fail to match y, leading to the failure of the whole matching.
The proof of the last case, when there is no backtracking, is straightforward given the previous discussion. 2 As a result of Lemma 3, we know that after matching a unique lexical non-terminal A we start a kind of unique path, and failing to match an expression that follows A indicates that the input is invalid. Therefore, we can safely annotate the expression p 2 that follows A.
Based on this, we present the Algorithm Unique, which automatically annotates a PEG G = (V , T , P , L, R, fail, p S ). In comparison with the Algorithm Standard, function labexp now receives an extra parameter, af terU , which indicates if we have already matched a unique lexical non-terminal, and function matchUni, which determines whether a parsing expression p matches at least one unique lexical non-terminal or not, is new. Below we discuss these functions in more detail. Functions annotate, calck and addlab remain the same and their definitions were omitted. 17 We assume the unique lexical non-terminals have already been computed. Given a non-terminal A, function isUniLex returns true in case A is a unique lexical non-terminal, and false otherwise. p y ← labexp(p 2 , false, af terU , f lw) 12: if seq and ε / ∈ F I R ST (p 1 / p 2 ) and af terU then 13: return addlab(p x / p y , f lw) 14: else 15: return p x / p y 16: else if p = p 1 * then 17: When p is a non-terminal that does not match the empty string and both seq and af terU are true (lines 2-3), then we associate a new label with p. When p is a non-terminal but these conditions do not hold, we will just return p itself (lines [19][20].

Algorithm Unique
In case of a concatenation p 1 p 2 (lines 4-7), the main difference to Algorithm Standard is the handling of parameter af terU when annotating p 2 (line 6). In this case, we supply a true value for af terU when it is already true or when p 1 consumes at least one unique lexical non-terminal.
When p is a choice p 1 / p 2 (lines 8-15), a main difference to Algorithm Standard is that we call labexp recursively even when the choice is not disjoint. In this case, we set af terU to false when annotating p 1 (line 10). The rationale is that is not safe to throw a label after failing to match p 1 in such case, since the parser can still backtrack and consume the input via p 2 . We will only add labels to p 1 in case an expression of p 1 matches a unique lexical non-terminal. When annotating p 2 , we pass the current value of af terU , since there is no other alternative and thus it is safe to annotate p 2 in 17 Actually, now function annotate provides a false value to af terU when calling labexp. In case p is a repetition p 1 * (lines [16][17][18], differently from Algorithm Standard and similarly to the case we discussed before, we also call labexp recursively when the repetition is not disjoint, providing a false value in this case. After applying Algorithm Unique, we could see we added, as expected, only correct labels to the grammars we have been discussing so far. In case of Titan, for example, we added 42 labels, while in case of Java we added 51 labels. To increase the number of labels inserted, we will do some extra analysis to determine whether when matching a given expression p we are in a unique path (and thus we can annotate p) or not.
Below, we discuss some analyses we did to compute this unique path. When evaluating Algorithm Unique, in Section 6, we assume this extra analysis was performed: • Unique Syntactical Non-Terminal: When an syntactical non-terminal A is only used after we have already matched a unique lexical non-terminal, then we can also mark A as unique and annotate its right-hand side. Both lexical and syntactical non-terminals can be marked as unique now, the main difference is that in case of a unique syntactical non-terminal this implies providing a true value for parameter af terU when calling labexp to annotate the right-hand side of A.
• Unique Context: If the lexical non-terminal A is used more than once in grammar G but the set S of tokens that may occur immediately before an usage of A is unique, i.e., ∀s ∈ S we have that s may not occur immediately before the other usages of A, then we can mark this instance of A preceded by S as unique.
In the next section, we compare the number of labels inserted by Algorithm Unique with the number of labels inserted via manual annotation and by using Algorithm Standard, as also as the resulting error recovering parsers obtained via each approach. Table 3 shows the amount of labels inserted for the Titan, C, Pascal and Java grammars when we used an automatic approach and when we used manual annotation.

Evaluating the conservative insertion of labels
We can see that the manual approach is the one that adds more labels for all the grammars we evaluated, then comes Algorithm Standard, and finally Algorithm Unique, which, as expected, did not insert labels incorrectly.
Overall, the amount of labels added by Algorithm Unique, when compared with manual annotation, ranged from 55%, in case of Java, to 78%, in case of Pascal. By its turn, when compared with Algorithm Standard, the Algorithm Unique was able to insert between 64%, in case of Java, and 81%, in case of Titan, of the labels inserted by it.
In Table 4, we can see that the smaller amount of labels, and thus of recovery rules, inserted by Algorithm Unique leads to a parser that performs a poorer recovery when compared to the error recovering parsers based on manual annotation and on Algorithm Standard. In the best scenario, the Pascal grammar, Algorithm Unique give us a parser that usually (in 84% of the cases) performs an acceptable recovery when the other two approaches do. In the worst scenario, the Java grammar, the parser produced by Algorithm Unique only performs an acceptable recovery in around half of the cases the other approaches do.
Below, in Sections 6.1, 6.2, 6.3 and 6.4, we discuss in more detail the use of Algorithm Unique to annotate the grammar of each language.

Titan
As we have mentioned in Section 4.1, the Titan grammar has 7 rules with non-LL(1) choices and 3 rules with non-LL(1) repetitions. After applying Algorithm Unique, we got a generated grammar 18 with 63 labels (around 75% of the labels added by manual annotation).
Algorithm Unique initially identified 44 unique lexical elements in the Titan grammar. Since that Algorithm Unique can annotate the first alternative of a non-disjoint choice when this alternative has a unique non-terminal that consumes input, we could add label CastMissingType in the rule below, where AS is a unique lexical non-terminal: Fig. 10 shows an excerpt of Titan grammar that we discussed in Section 4.1, without the predicates added by manual annotation. Non-terminal FOREIGN is unique, thus we can annotate the symbols that follow it. In Fig. 10, we represented as l 1 the labels added due to the uniqueness of FOREIGN.
To get a successful match, the start non-terminal must succeed, so we mark program as unique and thus we can annotate its right-hand side. In case of a repetition p 1 * , expression p 1 should match at least one token before we can annotate it. As p 1 is a choice, where the same rationale for p 1 * applies, we can not add labels to the alternatives in rule program.
The syntactical non-terminals toplevelvar, import and foreign are only used in rule program, so we could also mark them as unique and annotate their right-hand side. However, we will only annotate the right-hand side of foreign, because it is the last alternative of the non-disjoint choice in rule program involving these non-terminals. By marking foreign as unique, we can add the labels represented as l 2 in Fig. 10.
Finally, we can see in Fig. 10 that the lexical non-terminal IMPORT is used twice, so it is not unique. However, each use of IMPORT is preceded by a different context. In rule import, ASSIGN comes before IMPORT, while in rule foreign it is FOREIGN that precedes IMPORT. As we have different contexts, in rule import we can add the labels represented as l 3 .
The labels represented as lab were added by Algorithm Standard but not by Algorithm Unique. As we discussed in Section 4.1, labels ExpVarDec, in rule toplevelvar, and ImportImport, in rule import, should have not been added by Algorithm Standard. As expected, Algorithm Unique did not insert these labels incorrectly. We should notice that non-terminal COLON is used in other rules of the grammar, which were omitted here, so it is not unique. The error recovering parser generated by Algorithm Unique did an acceptable recovery for 75% of the test programs, while by manually annotating the grammar we could get an acceptable recovery for 95% of them.

C
Our unlabeled C grammar has 50 syntactical rules, from which 17 have non-LL(1) choices and 5 have non-LL(1) repetitions. Algorithm Unique was able to generate an error recovering parser 19 with 50 labels.
In case of our C grammar, Algorithm Unique added only 58% of the amount of labels inserted manually, while for Titan it could add 73% of this amount. The higher occurrence of non-disjoint expressions in the C grammar makes more difficult to mark symbols as unique and also to propagate a unique path after we have seen a unique symbol. In such grammars, to insert more labels it seems we need to do a more sophisticated analysis when computing the unique non-terminals.
Below, we revisit the if-else statement presented in Fig. 7 and discuss an extra analyses we did to mark an usage of IF as unique and helped us to achieve the amount of 50 labels. In Fig. 11, we used lab to represent a label added by manual annotation but not by Algorithm Unique: Initially, the only unique non-terminal is ELSE, which allows us to add just label Stat 1 . Non-terminal IF was not considered unique at first because it is used twice, and both uses are preceded by the same context. To be able to annotate the second alternative of a non-disjoint choice such this, we check if the two usages of a non-terminal A with a context in common occur in the same right-hand side. If it is the case, we mark the last usage as unique. After doing this, we could add the labels represented as l 2 in Fig. 11.
We can see in Table 4b that the parser generated by Algorithm Unique performed an acceptable recovery for 58% of the test files, while by manually annotating the grammar we got a 94% rate of acceptable recovery for the same test files.

Pascal
As mentioned in Section 4.3, only 10 syntactical rules, out of 67, from the Pascal grammar have either a non-disjoint choice or a non-disjoint repetition. Because of this low number of non-disjoint expressions, the recursive banning approach discussed in Section 5.1 can annotate the Pascal grammar with 36 labels. By its turn, Algorithm Unique was able to add 72 labels.
As there are eight labels which were only added by the banning approach, in case of Pascal we automatically generated an error recovering parser 20 which joins the labels added by these two approaches and thus has 80 labels.
We can see in Table 3c that Algorithm Unique only inserted correct labels and was able to insert around 80% of labels added manually. In Table 4c we can see the resulting error recovering parser performs an acceptable recovery for 76% of the test files, while the parsers based on the other approaches perform such recovery for 91% of the test files.
A manual inspection revealed that the parser generated by Algorithm Unique built an AST with less information than the parser generated by Algorithm Standard for the files related to a label inserted only by Algorithm Standard, which shows we got a poorer recovery in these cases due to the missing labels.

Java
In case of our unlabeled Java grammar, where there is a non-disjoint expression in one third of the 147 grammar rules, Algorithm Unique generated a grammar 21 with 96 labels.
As was the case in our C grammar (Section 6.2), the higher occurrence of non-disjoint expressions in the Java grammar makes more difficult to annotate it. In case of Algorithm Standard, this leaded to adding 32 labels incorrectly, while in case of the recursive banning approach discussed in Section 5.1 this resulted in not adding a single label to the Java grammar. As Table 3d shows, Algorithm Unique was able to add 55% of the amount of labels added manually, without inserting labels incorrectly.
From Table 4d, we can see that the error recovering parser generated by Algorithm Unique only performed an acceptable recovery for 40% of the test files. This result was somehow expected, since that the algorithm failed to add many labels that were inserted during the manual annotation.

Related work
In this section, we discuss some error reporting and recovery approaches described in the literature or implemented by parser generators. Overall, a distinctive feature of our approach is that our error recovery mechanism is integrated with the recognizing formalism (PEGs, in our case).
Swierstra [16] shows a sophisticated implementation of parser combinators for error recovery. The recovery strategy uses information about the tails of the pending rules in the parser stack. When the parser fails to match a given symbol it may insert this symbol or to remove the current input symbol.
Our approach cannot simulate this recovery strategy, as it relies on the path that the parser dynamically took to reach the point of the error, while our recovery expressions are statically determined from the label. In Swiertra's approach, in case the right-hand side of the rules are not in some normal form, the parser may have a high memory consumption.
A popular error reporting approach applied for bottom-up parsing is based on associating an error message to a parse state and a lookahead token [17]. To determine the error associated to a parse state, it is necessary first to manually provide a sequence of tokens that lead the parser to that failure state. We can simulate this technique with the use of labels. By using labels we do not need to provide a sample invalid program for each label, but we need to annotate the grammar properly.
The error recovery approach for predictive top-down parsers proposed by Wirth [18] was a major influence for several tools. In Wirth's approach, when there is an error during the matching of a non-terminal A, we try to synchronize by using the symbols that can follow A plus the symbols that can follow any non-terminal B that we are currently trying to match (the procedure associated with B is on the stack). Moreover, the tokens which indicate the beginning of a structured element (e.g., while, if) or the beginning of a declaration (e.g., var, function) are used to synchronize with the input.
Our approach can simulate this recovery strategy just partially, because similarly to [19] it relies on information that will be available only during the parsing. We can define a recovery expression for a non-terminal A according to Wirth's idea, however, as we do not know statically how will be the stack when trying to match A, the recovery expression of A would use the F O LL O W sets of all non-terminals whose right-hand side have A, and could possibly be on the stack.
Coco/R [20] is a tool that generates predictive LL(k) parsers. As the parsers based on Coco/R do not backtrack, an error is signaled whenever a failure occurs. In case of PEGs, as a failure may not indicate an error, but the need to backtrack, in our approach we need to annotate a grammar with labels, a task we tried to make more automatic.
In Coco/R, in case of an error the parser reports it and continues until reaching a synchronization point, which can be specified in the grammar by the user through the use of a keyword SYNC. Usually, the beginning of a statement or a semicolon are good synchronization points. Another complementary mechanism used by Coco/R for error recovery is weak tokens, which can be defined by a user though the WEAK keyword. A weak token is one that is often mistyped or missing, as a comma in a parameter list, which is frequently mistyped as a semicolon. When the parser fails to recognize a weak token, it tries to resume parsing based also on tokens that can follow the weak one.
Labeled failures plus recovery expressions can simulate the SYNC and WEAK keywords of Coco/R. Each use of SYNC keyword would correspond to a recovery expression that advances the input to that point, and this recovery expression would be used for all labels in the parsing extent of this synchronization point. A weak token can have a recovery expression that tries also to synchronize on its F O LL O W set. Coco/R avoids spurious error messages during synchronization by only reporting an error if at least two tokens have been recognized correctly since the last error. This is easily done in labeled PEG parsers through a separate post-processing step.
ANTLR [21,22] is a popular tool for generating top-down parsers. ANTLR automatically generates from a grammar description a parser with error reporting and recovery mechanisms, so the user does not need to annotate the grammar. After an error, ANTLR parses the entire input again to determine the error, which can lead to a poor performance when compared to our approach [4].
As its default recovery strategy, ANTLR attempts single token insertion and deletion to synchronize with the input. In case the remaining input can not be matched by any production of the current non-terminal, the parser consumes the input "until it finds a token that could reasonably follow the current non-terminal" [23]. ANTLR allows to modify the default error recovery approach, however, it does not seem to encourage the definition of a recovery strategy for a particular error, the same recovery approach is commonly used for the whole grammar.
A common way to implement error recovery in PEG parsers is to add an alternative to a failing expression, where this new alternative works as a fallback. Semantic actions are used for logging the error. This strategy is mentioned in the manual of Mouse [24] and also by users of LPeg. 22 These fallback expressions with semantic actions for error logging are similar to our recovery expressions and labels, but in an ad-hoc, implementation-specific way.
Several PEG implementations such as Parboiled, 23 Tatsu, 24  The previous version of Parboiled used an error recovery strategy based on ANTLR's one, and requires parsing the input two or three times in case of an error. Similar to ANTLR, the strategy used by Parboiled was fully automated, and required neither manual intervention nor annotations in the grammar. Unlike ANTLR, it was not possible to modify the default error strategy. The current version of Parboiled 26 does not have an error recovery mechanism.
Tatsu uses the fallback alternative technique for error recovery, with the addition of a skip expression, which is a syntactic sugar for defining a pattern that consumes the input until the skip expression succeeds. PEGTL (version 3) makes a distinction between a local failure (a regular failure), and a global failure, which is equivalent to throwing a label. The PEGTL user can use a function raise to produce a global failure, which is similar to annotate a grammar with labels. Mizushima et al. [25] proposed the use of a cut operator, borrowed from Prolog, to avoid unnecessary backtracking in PEG parsers, and propose an automatic way to insert this operator in a grammar. Differently from the throw operator, which leads to a global failure, in case there is no recovery rule, the cut operator just discards the next alternative of the current choice, which makes difficult the use of cut operators to signal an error as the parser can still backtrack. The algorithm proposed by [25] to insert the cut operator is similar to Algorithm Standard. However, the former algorithm seems to do less insertions, as it does not annotate the second alternative of a choice, since there is no local backtracking to discard in this case.
Rüfenacht [26] proposes a local error handling strategy for PEGs. This strategy uses the farthest failure position and a record of the parser state to identify an error. Based on the information about an error, an appropriate recovery set is used. This set is formed by parsing expressions that match the input at or after the error location, and it is used to determine how to repair the input.
The approach proposed by Rüfenacht is also similar to the use of a recovery expression after an error, but more limited in the kind of recovery that it can do. When testing his approach in the context of a JSON grammar, which is simpler than grammar we analyzed, Rüfenacht noticed long running test cases and mentions the need to improve memory use and other performance issues.
The evaluation of our error recovery technique was based on Pennelo and DeRemmer's [27] strategy, which evaluates the quality of an error recovery approach based on the similarity of the program obtained after recovery with the intended program (without syntax errors). This quality measure was used to evaluate several strategies [28][29][30], although it is arguably subjective [30].
Differently from Pennelo and DeRemmer's approach, we did not compare programming texts, we compared the AST from an erroneous program after recovery with the AST of what would be an equivalent correct program.

Conclusion
We proposed algorithms to automate the process of adding error reporting and error recovery to parsers based on Parsing Expression Grammars. These algorithms annotate a PEG with error labels and associate recovery expressions for these labels.
We evaluated such algorithms on the grammars of four programming languages: Titan, C, Pascal and Java. For all these languages, we build a test suite both for valid and erroneous input.
Algorithm Standard could add to these grammars at least 75% of the labels added manually. The error recovering parser we got produced an acceptable recovery for at least 70% of the syntactically invalid files of each language.
The major limitation of Algorithm Standard is that it can annotate the right-hand side of a non-terminal A that is used either in a non-LL(1) choice or in a non-LL(1) repetition. This may prevent the parser from backtrack and recognize a valid input, thus changing the grammar language.
To address this issue, we proposed Algorithm Unique, which uses a more conservative approach, based on the idea of unique non-terminals. By using it, we inserted only correct labels and got an acceptable recovery rate that ranged from 41% to 76%.
We have also discussed how the rewriting of some grammar rules could lead both algorithms to produce a better result. The automatic insertion of labels provides a good generic error reporting mechanism. To get more specific error messages, the parser developer just needs to associate an error message with each inserted label.
It is easy to adapt our algorithms to use a different error recovery strategy, which can also be defined after inserting the labels. It is also possible to adapt them to work on grammars that have already been partially annotated, either with just labels or labels and recovery expressions, as well as marking the parts of the grammar the algorithm should ignore and that will be annotated by hand by the parser developer.
To generate a more robust error recovering parsing, the approach based on unique tokens should insert more labels. One way to achieve this is by using the derivative of a PEG [31,32] to automatically generate valid inputs based on a grammar without annotations. After applying Algorithm Unique, we could repeatedly try to insert a label added only by Algorithm Standard and use the valid input generated through derivatives to determine whether the insertion of this label is correct or not.
As a future work, we should also explore other grammar analysis that may lead Algorithm Unique to insert more correct labels.
Moreover, we may investigate the use of some normal form when writing a PEG grammar to help our algorithms to produce a better result, without imposing too much restrictions for a grammar writer.
Finally, as the use of labeled failures may avoid unnecessary backtracking, we should also analyze the performance of the generated parsers.