This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||
|
I'm not convinced that the grammar listed for is correct.
A matches any number of a's followed by the same number of b's. B matches any number of b's followed by the same number of c's.
&(A !b) ensures that we have any number of a's followed by the exactly the same number of b's. a* consumes all the a's, and then B !c matches any number of b's followed by the exactly the same number of c's.
In other words, A and B overlap: they match the same string of b's. They are more symmetrical that the rule suggests.
The problem is that the A and B rules also match the empty string, so this grammar accepts for n=0, which per the prose, it shouldn't. (Also, since we know there should be at least one a, we can use a+ instead of a*).
I'm going to change the A and B rules not to match the null string, the S rule to use a+, and use math markup for the equation (and n >= 1 rather than n > 0, both for consistency with Context-sensitive grammar).
Malcolm rowe 13:06, 26 September 2005 (UTC)
The solution given by Malcolm rowe above is correct, but as of this edit, the example in the article is broken. The rule as written:
S ← &A 'a'+ B A ← 'a' A? 'b' B ← 'b' B? 'c'
actually matches .
I'm changing this to:
S ← (&A !'b') 'a'+ B !'c' A ← 'a' A? 'b' B ← 'b' B? 'c'
Mike Segal (talk) 01:31, 15 June 2015 (UTC)
(Removing my own earlier comment.) QTJ 05:23, 11 October 2006 (UTC)
What does REBOL have to do with parsing expression grammars? Using them or providing a mechanism to use them doesn't count.
The definition of a parsing expression grammar says it consists of
a finite set N of nonterminal symbols a finite set Σ of terminal symbols that is disjoint from N a finite set P of parsing rules
Please provide clearly and explicitly in all of the examples, what all three of these are.
S <- if C then S else S / if C then S
"if" "then" and "else" are in the finite set of terminal symbols? If so, then say so. And so forth for the rest of the examples.
thank you. --LAM 66.28.244.66 17:04, 15 May 2006 (UTC)
^^ ++! Additionally it does not help to say "if/then/else" would be standard C-style, because in C you don't have "then", this has been confusing.
Another problem that should IMHO be mentioned in the article manifests itself here: The order in which one specifies the choices is very important. E.g. the following rule would not be correct:
S <- if C then S / if C then S else S
--92.202.58.200 (talk) 18:10, 16 December 2008 (UTC)
The first example (PEG for mathematical formulas that use the basic four operations) is (was, now) incorrect. It allows expressions such as 3.2 + ..2..3..000
Only a single period should be allowed. —Preceding unsigned comment added by HappyDog (talk • contribs)
Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Expr ← Product (('+' / '-') Product)*
Because the `Sum' in the existing example isn't really a sum (it contains both sum and product), but later assigning it to `Expr' is a redundancy. Does it make sense? — Preceding unsigned comment added by 62.219.142.171 (talk) 15:20, 25 July 2011 (UTC)
I suggest that the packrat parser article stub be merged into this article. Since I am a parsing technology author myself, I recuse myself from the discussion beyond this suggestion, to avoid conflicts-of-interest. My rational is that packrat parsing is not widely enough adopted to get a full stub. Packrat parser could then be a redirect to this page, instead.
-- QTJ 00:40, 19 October 2006 (UTC)
I second QTJ's suggestion.
--Junkblocker 00:39, 20 January 2007 (UTC)
I propose that the two are separated again. PEGs are a formalism whilst the packrat parser is an algorithm. I found the merge confusing when reading here. Are the two formally equivalent? If not, they deserve separate articles. Even if they are, I think this is a good idea. We would not combine, e.g. the article on regular expressions with that on finite state automata. The definition of the packrat parser given here is thin. It should be expanded, but probably not within the PEG page.
Your thoughts? --winterstein (talk) 07:39, 11 May 2009 (UTC)
For (* comments *) the article claims:
but wouldn't N* eat everything including the closing comment bracket? Shouldn't this be:
to avoid eating the closing comment backet? The only reason to not allow N ← (* is to generate an error if unbalanced. 59.112.63.188 18:09, 15 April 2007 (UTC)
We say this:
Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation.
It's not clear to me how they have a different interpretation from regular expressions. Seems to me that's what they are, with the possible exception of the & and ! lookahead functionality. Can someone explain how they are different? --Doradus 03:33, 13 August 2007 (UTC)
Under disadvantages it is mentioned that PEG's cannot handle left recursive rules. I believe this statement is correct, but it is not implied by the current definition. ie. The current definition of a peg is deficient.
—Preceding unsigned comment added by 202.37.96.11 (talk) 22:54, 24 October 2007 (UTC)
expr = term (('+'/'-') tetm)*;
I think the statement that left recursion is not supported by packrat parsers should be removed, as a paper *mentioned in the article* provides a way to parse left recursive PEGs. --Tetha (talk) 12:06, 10 December 2008 (UTC) Tetha
The definition of a peg doesn't mention grouping of expressions (e), but the example uses it. Which is right?
I suspect the answer is the definition needs to be extended to match the example.
—Preceding unsigned comment added by 202.37.96.11 (talk) 22:47, 24 October 2007 (UTC)
Note that ordered choice interpretation of arbitrary context-free grammars is not new. One example of this is the TXL programming language (http://www.txl.ca/), which has existed since the mid 80s. Add that regular expressions can be de-sugared to context free grammars and it is hard to see PEGs as a new kind of formalism. —Preceding unsigned comment added by Thurston51 (talk • contribs) 23:40, 27 November 2007 (UTC)
How do PEGs compare to CFGs in expressive power? Rp (talk) 10:42, 30 November 2007 (UTC)
This is wrong. As the wikipedia-article here demonstrates, PEGs are able to describe certain context sensitive languages, so CFG cannot be a superset of PEG. However, I am still wondering if PEG are more powerful than CFG or of PEG and CFG cannot be compared using subset relations --Tetha (talk) 15:39, 11 December 2008 (UTC)
I think, the following is imprecise:
The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.
The "a" in "a b" will not be matched, though the "a" is not "the first in an arbitrarily-long sequence of a's followed by a b", because the " arbitrarily-long sequence of a's" is empty in this case, so it does not have the first element. I'd propose to change it to "the first in a non-empty sequence of a's followed by b". I won't do the change myself, because I'm a newbie. --Mikon (talk) 09:30, 26 January 2009 (UTC)
I read this months ago and it was bad enough I had to correct it:
Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be exactly one rule in the grammar having a given nonterminal on its left-hand side. [...]
Now, somebody felt it prudent to put this text back in. This is wrong. This difference is a purely syntactic issue that offers no meaningful insight into the differences between PEGs and CFGs. It's just that instead of writing alternate possibilities as, say
A -> () A -> (A) A -> A A
You write this grammar like this instead:
A -> () | (A) | A A
To further illustrate my argument, I would point to the first page of Shriram Krishnamurthi's book, Programming Languages: Application and Interpretation, which is electronically available under the Creative Commons, and I quote as follows:
The first insignificant attribute is the syntax. Syntaxes are highly sensitive topics, but in the end, they don’t tell us very much about a program’s behavior. For instance, consider the following four fragments:
1. a [25] 2. (vector-ref a 25) 3. a [25] 4. a [25]Which two are most alike? The first and second, obviously! Why? Because the first is in Java and the second is in Scheme, both of which signal an error if the vector associated with a has fewer than 25 entries; the third, in C, blithely ignores the vector’s size, leading to unspecified behavior, even though its syntax is exactly the same as that of the Java code. The fourth, in ML or Haskell, is an application of a to the list containing just one element, 25: that is, it’s not an array dereference at all, it’s a function appliction!
http://www.cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04-26/
Obscuranym (talk) 04:01, 21 April 2009 (UTC)
A -> B
, A -> C
are not the same as the ordered choice A -> B/C
. The following grammer is ambiguous:S -> A S -> a A -> a
It is not obvious, but ordered choice is very different from unordered choice when you get into the theoretical details. Best way to understand this is probably to work through Section 4 of Ford's POPL'04 paper: how to eliminate lookahead constraints. You can rewrite any PEG with lookahead constraints into an equivalent (but larger) PEG not using lookahead constraints, by exploiting the ordered aspect of / in sneaky ways. 88.129.117.158 (talk) 13:15, 25 December 2023 (UTC)
The example given for a language that cannot be recognized by PEGs is: S ← 'x' S 'x' | 'x'. Is this not recognizable through the following PEG? S ← 'x' ('x' 'x')* —Preceding unsigned comment added by 69.251.13.250 (talk) 11:21, 29 December 2009 (UTC)
S ← 'x' ( S 'x' | ε)
Steamerandy (talk) 13:13, 30 October 2015 (UTC)
I think this article is remiss in not citing the original paper that introduced PEGs. The full citation is Bryan Ford, "Parsing Expression Grammars: A Recognition Based Syntactic Foundation", Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2004. The online home is: http://portal.acm.org/citation.cfm?id=964001.964011
Since this recent paper introduced the concept of PEGs, I think both that it should be cited and referenced as the source of the idea, but the discussion of the concept itself should probably be derived from this work. 128.173.236.168 (talk) 15:43, 17 March 2010 (UTC)
"Compared to regular expressions, PEGs are strictly more powerful" - Unlike PEG, regexes have back-references, so how PEG can be "strictly more powerful"? Enerjazzer (talk) 12:58, 7 July 2010 (UTC)
Isn’t this just non-strict LL(1)? It works like LL(1), but conflicts get ignored, we say “first rule has precedence”. Otherwise I would like to have an example for great stuff Packrat can parse. --Chricho (talk) 12:48, 11 September 2010 (UTC)
I think the anti-LL(1) example works for Packrat, too:
S ← A|B A ← a A b | ε B ← a B b b | ε
or did I get anything wrong with the & operator and it is possible to express this language with a PEG? --Chricho (talk) 00:11, 15 September 2010 (UTC)
Does:
S ← &(a b) a b | &(a c) a c
accept the input ac?
S ← a b | a c
would certainly not accept it. Or wait: “The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.” – that cannot be true. I think it makes the decision based on the first-set, it will not backtrack. --Chricho (talk) 00:17, 15 September 2010 (UTC)
The linked topic doesn't have anything relevant. Either that's an invented term, inapt (caching comes to mind as an alternative), or the linked topic needs some sort of disambiguation to support the link TEDickey (talk) 11:36, 23 November 2010 (UTC)
I couldn't find one, but there are mentions of real parsers in the article. Would anyone post a link to a program / library where they are used? Thanks 109.64.130.34 (talk) 17:18, 25 August 2013 (UTC)
The statement under the heading "Subtle grammatical errors" is pushing the definition of "disadvantage" a little. It's a pretty ridiculous leap of logic to say that PEGs are flawed because it takes effort to translate ambiguities from a lesser formalism. Those ambiguities are a flaw in the original grammar, not in the PEG translation. In fact, I'm going to remove this silly and bizarre statement, since it's completely flawed reasoning and adds nothing insightful to the article anyway. — Preceding unsigned comment added by 82.9.176.129 (talk) 14:46, 24 September 2013 (UTC)
The claim that the running time of the Bellman-Ford algorithm of is quadratic (in ) is true only if the graph is sparse (i.e. ). A dense graph in which would yield the same asymptotic running time as the Floyd-Warshall algorithm (which is cubic, not quadratic). Newtsjm (talk) 06:39, 22 August 2015 (UTC)
Here is the deal. This isn't a parser but a programming language for writing compilers that generates code directly as written. The current method is that one uses a parser generator that usually generates a table for an already written parser. But being a programming language the programmer is writing the parser in a grammar specifically designed for the purpose. This was published in an ACM paper in 1970. It is close to a GTDPL (Generalized Top Down Parsing Language). Probably in it's own category an ETDPL (Extended Top Down Parsing Language). Extended because it explicitly control backtracking and may back track any number of rules. It also provides general look ahead features.
After looking over the PEG topic here I was thinking it to be a PEG. Because CWIC[1] was developed and an ACM paper published in 1970 doesn't mean that it was ever classified as other then a metacompiler. Most of the parser classification (types) were not named at that time. So what if PEG is not a new idea? But in a way I an not so sure that CWIC is a PEG. These meta compilers were said to have recursive functions. One of the problems I have with categorizing some parsers as recursive decent. Is that there may not be any recursion. It may be that I am just old school but in my day recursion is a function calling it's self. Such as the recursive definition of a factorial. To me allowing recursion and using recursion are not the same thing. For example using the grammar we define an arithmetic expression as:
expr = term $(('+'|'-') term); // This rule looks like EBNF. I know. But in CWIC it would not produce any output of any kind except success or failure.
That specific rule is not recursive. At some point recursion would be used for a parenthesized expression '(' expr ')'. In this language that rule does not produce output or actions. If to the above rule we add control for building lists and tree structures, recursion can be used to control the form of a tree.
expr = term $(('+':ADD|'-':SUB) term!2); // builds a binary tree of the form ADD[<something>,<something>] or SUB[<something>,<something>]
Given the expression a+b-c we get first a tree of the form ADD[a, b] and loop to make a SUB[ADD[a,b],c] A left handed tree. using looping. Redefining expr in a recursion form:
expr = term (('+':ADD|'-':SUB) expr!2|.EMPTY);
We get a right handed tree ADD[a,SUB[b,c]] because the subtraction tree was built by calling expr recursively and then the ADD tree is built upon return.
Again if we redefine expr to make a list and only mark substraction:
expr = +[ term $('+' term | '-' term:MNS!1) ]+;
We get a list: [a, b, MNS[c]]
In this parser recursion is controlled by the programmer. Recursion is necessary for nested language constructs such as parenthesized expression (logical or arithmetic), nested block structures etc. The language is compiled into executable code. Each rule is a function returning success or failure. It specifically controls backtracking using a different alternate operator. limited backtracking is automatic when parsing strings and tokens. Only the input stream is backtracked. Controlled backtracking resets the state, undoing trees, nodes and lists. And may be nested. Controlled backtracking may bail out of any number of nested rules.
These examples are based on a compiler compiler that was working in 1970. It is a top down analytic parser no question. But is it really recursive decent? The example below is from an updated version defined to be more like c or c++.
Below 'program' is top driver rule. A program is zero or more declarations.
program = $ declarations; declarations = '#' directive | comment | global DECLAR(*1) |(id (grammar PARSER(*1) | sequencer GENERATOR(*1) | optimizer ISO(*1) | pseudo_op PRODUCTION(*1) | emitter_op MACHOP(*1))) || ERRORX("!Grammer error") garbol); grammar = ( ":" class :CLASS // parse character class rule | ".." token :TOKEN // parse token rule | "==" syntax :BCKTRAK // parse a ankered grammar rule | "=" syntax :GRAMMAR // this grammar can long fail. )";"!2 // Combine name and rule tree $(-break -"/*" .any);
A declarations can be a directive which is recognized by the '#' character. Or a declarations could simply be a comment.
comment = "//" $(-.break .any) | "/*" $(-"*/" .any) "*/"); // Defines a c or c++ stile comment such as this comment.
Or declarations could be a global declaration. Getting to declarations that start with an identifier. An id may be followed by a grammar, sequencer, optimizer, pseudo_op, or an emitter_op. It is the grammar I am interested on discerning the classification of here. The double bar alternate is a backtrack alternative that catches an error, Calls ERRORX and garbol. garbol simply looks for a ; and returns when matched. This is a case were backtracking can bail any number of stacked rules. A garbol rule recovers from an error and skips to the end of a rule:
garbol = $ (-';' .any) ';'; // zero or more, match not ';' loop not matching ';' (skipping .any characters not a ';') and then match ';'. skips to end of rule in error.
Below are some character class and token making rule examples:
bin: '0'|'1'; // Binary character class can be a 0 or 1. oct: bin | '2' | '3' | '4' | '5' | '6' | '7'; // the oct class references the bin class dgt: oct | '8' | '9'; // previous defined classes can be referenced. hex: dgt | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' // '<character>' is a single character string. | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'; chr_const .. "''''" , "'" | ''' any ''' MAKSTR(); // match a single character string. str_const .. '"' $(~'"' | """""" ,'"') '"' MAKSTR(); // matches a string of characters. Double "" are matched, single " is placed in string
The class defining rule is:
class = +[ character $('|' character) ]+ ';'; // id ':' class :CLASS!2 PARSER(*1); CLASS{<id>,[<character list>]] character = $comment (chr_const | integer | id classcheck()) $comment;
MAKESTR[] is a function that intercedes the making of a symbol and creates a string object. The sequence "id classcheck()" checks that id is defined as a class. It can not be the current class as it's id is not yet defined as a class.
The '..' token defining rules and ':' character classes are definitely not recursive. TOKEN rules could only use string constants and character classes. Character classes are not functions. In the implementation character classes are used in generating a class map table that was indexed by the character code. Bits at the indexed location indicated the characters classes memberships. Optimizing testing class membership with a single instruction on many computers.
token = chr_pattern (maker:INTERCEPT!2 ('|' token:ALT!2 |--) |--); chr_pattern = +[basic $ basic]+; basic == '$' basic :$!1 |'(' tokenxpr ')' | str_const | char_match | insert; tokenxpr = chr_pattern ('|' (+"--" | tokenxpr) :ALT!2 | -- ); char_match = ('+':RETAIN|'-':NEG|'~':NOT) (str_const | chr_const)!1 | char_test; char_test == +"any" | +"ANY" | (chr_const | id) ('*' :$!1 | --); insert = ',':INSERT (chr_const | str_const)!1 maker == procid '(' +[ (m_arg $(',' m_arg)) |-- ]+ ')' :CALL!2; m_arg = str_const | integer | character | id;
The above is thesyntax of the token defining rule.
Character class rules and token rules can not be recursive. Character classes can reference only preexisting character classes.
Token rules can only reference character classes and match stings. and call interceding functions as the last thing of an outermost alternate change. The integer rule of the compiler compiler is an example that parses an integer.
bin: '0'|'1'; oct: bin | '2' | '3' | '4' | '5' | '6' | '7'; dgt: oct | '8' | '9'; hex: dgt | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'; integer .. ("0b"|"0B") bin bin* MAKEBIN() // binary integer |("0o"|"0O") oct oct* MAKEOCT() // octal integer |("0x"|"0X"|"0h"|"0H") hex hex* MAKEHEX() // hex integer | dgt dgt* MAKEINT(); // decmal integer
I am looking for opinions for classifying the beast. It could produce textual or binary machine code as output. The list or tree constructs were passed by calling function from the parser rules. A complete compiler could be written in this language. Except for run-time libraries for programs produced by compilers written in this compiler compiler. The examples are a new syntax that more resembles current languages. CWIC used / and \ for alternates. It's character set was limited to punch card codes available. The token rules are extended to allow multiple forms of the same type token. Integer in different bases as above. CWIC would have used a separate rule for each base and an integer rule
There is an example of a simple interpreter written in CWIC over on the metacompiler talk page.
Hope this helps.
But even if this is ultimately decided to be a PEG language it shouldn't nullify Ford's work. Most of the compiler terminology of today was not invented when CWIC was developed.
--Steamerandy (talk) 09:59, 27 October 2014 (UTC)
References
The claim has been flagged with [citation needed] since 2011, and is false. Many tools for generating LR parsers expect the input to be output from a tokenizer, but that isn't a requirement. It's just a matter of defining the alphabet of terminals to be bytes or Unicode codepoints or whatever instead of tokens. ~ LukeShu (talk) 03:57, 13 November 2015 (UTC)
Under "Implementing parsers from parsing expression grammars", it says "It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a packrat parser". "packrat parser". redirects to that same section. If someone wants to fix that somehow, idk. — Preceding unsigned comment added by 70.131.44.130 (talk) 02:07, 7 March 2019 (UTC)
I am going to argue for reviving the citations that were removed in revision 812879723, see https://en.wikipedia.org/w/index.php?title=Parsing_expression_grammar&type=revision&diff=812879723&oldid=812750522. The reason for removal by User:Melcous was legitimate (linking to own website) but the article has since been published (https://doi.org/10.1145/2814251.2814265) and is available from an independent open access provider (https://arxiv.org/pdf/1509.02439v1.pdf). Both Medeiros' and Laurent's contributions improve on the existing citation of Warth et al., and both were instrumental for me in adding left-recursion support to Pegged. I am therefore considering to bring both citations back in, and maybe add a sentence about their individual contributions. Please comment if there are objections. Bastiaan Veelo (talk) 22:12, 19 April 2019 (UTC)
The grammar given in the "Indirect left recursion" section is incorrect. It would parse "1+2*3+4" as ((1+2)*3)+4.
Since this edit, the article has likended the lookahead constraints ! and & of PEGs to those of boolean grammars. That's a weird point of reference, considering that boolean grammars (i) appears to be a more recent development (main publication 2010, as compared to 2004 for PEG) and (ii) are more complicated to interpret (require solving a system of functional equations). If one wishes to compare the PEG lookaheads to anything, it would make more sense to reference those of advanced regular expressions — (?=…) and (?!…) as described for example in [2] — but of course any comparison with regular expressions opens up the Pandora's box of readers getting confused by not-at-all regular features available in major "regular expression" engines. Therefore I'll just undo that edit, moving the link instead to the "See also" section.
To elaborate on the regular expression option: that comparison would be apt, in that adding or removing lookaheads as features will not change the underlying class of languages — regular languages and parsing expression languages are still the same. On the PEG side Ford shows this in his paper, by the transformation to eliminate predicates. On the regular side it's probably easiest to do in terms of automata: take the cartesian product of the automaton for the lookahead expression and the automaton for the main expression (minus the lookahead), then modify set of accepting states so both have to accept (there are some additional fine details to deal with, which depend on whether you do it in terms of DFAs or NFAs, but mainly you runs multiple automata in parallel).
Moreover, I'm somewhat skeptical towards the original claim. The lookaheads in regular expressions and PEGs are notably not constrained to a particular subexpression — that is essential for how they are used in the grammar for — but boolean grammars rather seem like all patterns involved in the definition of a nonterminal are matched against the same range of input (though that article is still pretty much a stub, so I could be getting it wrong). Thus even if both kinds of grammar might have "lookahead constraints" that can be eliminated without changing the class of languages, it is likely that those constraints do not behave in the same way. 88.129.117.158 (talk) 15:48, 25 December 2023 (UTC)
The start of the article introduces PEGs using abstract syntax, e.g. for ordered choice, but later the exposition switches to concrete syntax, e.g. E1 / E2
and 'x' E1
, without comment or explanation. That is likely confusing if you're not already familiar with these grammars, and should be clarified. 88.129.117.158 (talk) 14:50, 28 December 2023 (UTC)
The article gives Aho–Sethi–Ullman (1986) as one source for the claim that left recursion can be eliminated. It seems dubious that the techniques in this book would suffice for PEGs; the CFG technique of repeatedly expanding an initial nonterminal works because if the initial item is not a nonterminal then it is a terminal and thus not zero length, but in a PEG there are zero-length items which don't just disappear. Consider
S <- !(.'x') S / 'wx'
88.129.117.158 (talk) 12:30, 5 January 2024 (UTC)
Confirming that the techniques in Aho–Sethi–Ullman — relevant pages are 176–178, plus some exercises — are specifically for context-free grammars (and even if that could be lifted, for generative grammars). They rely heavily on order of choices being irrelevant. 37.197.59.228 (talk) 11:30, 13 January 2024 (UTC)