Wow, this is surprising to me. Why doesn’t this work?

HelloNurse · on June 5, 2024

See the nearby comment by adwn: aggressive "packrat" caching fixates on long matches starting at intermediate positions, without considering any shortest matches at the same position.

Correct parsing would require discarding these long matches because they cause higher level rule expansions to fail, but smarter algorithms with an increase of runtime due to complete backtracking and/or an increase of memory usage to distinguish and remember different matches for the same rule at the same location would be required.

EDIT: two nearby comments by adwn.

kragen · on June 5, 2024

this is not specific to the packrat parsing algorithm; any algorithm for parsing pegs has the same behavior as packrat. your understanding of peg semantics is incorrect

adwn · on June 5, 2024

(I'll use position 1...5 to refer to the characters in "aaaaa", and first, second, third level to refer to the initial and then the recursive calls to the Str rule)

1. When the PEG parser enters the third level of Str, its first choice will successfully match on positions 3...5, and return to the second level of Str.

2. The first choice on the second level will fail, however, because it expects an "a" at position 6, which doesn't exist. Therefore, the first choice is discarded.

3. The second choice on the second level will succeed by matching "a" on position 2, successfully returning to the first level.

4. The first level finishes matching the first choice by consuming the "a" on position 3, successfully returning the parse tree Str = ("a" ("a") "a")

5. The remaining "aa" on positions 4...5 weren't consumed.