Errors #15

char0n · 2020-12-14T20:23:08Z

I've been playing with how Errors and tree-sitter recovery mechanism work in this grammar and found some interesting cases that I'm not sure how to handle.

---
openapi: 3.1.0
x-top-level: value
info:
   title: Sample API
  unknownFixedField: value
  description: Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/)
    or HTML.
  summary: example summary
  termsOfService: Terms of service
  version::: 0.1.9
  x-version: 0.1.9-beta
  license:
    name: Apache License 2.0
    x-fullName: Apache License 2.0
    identifier::: Apache License 2.0
    url: https://www.apache.org/licenses/LICENSE-2.0

produces

ERROR [0, 0] - [4, 20])
  block_mapping_pair [1, 0] - [1, 14])
    key: flow_node [1, 0] - [1, 7])
      plain_scalar [1, 0] - [1, 7])
        string_scalar [1, 0] - [1, 7])
    value: flow_node [1, 9] - [1, 14])
      plain_scalar [1, 9] - [1, 14])
        string_scalar [1, 9] - [1, 14])
  block_mapping_pair [2, 0] - [2, 18])
    key: flow_node [2, 0] - [2, 11])
      plain_scalar [2, 0] - [2, 11])
        string_scalar [2, 0] - [2, 11])
    value: flow_node [2, 13] - [2, 18])
      plain_scalar [2, 13] - [2, 18])
        string_scalar [2, 13] - [2, 18])
  flow_node [3, 0] - [3, 4])
    plain_scalar [3, 0] - [3, 4])
      string_scalar [3, 0] - [3, 4])
  block_mapping_pair [4, 3] - [4, 20])
    key: flow_node [4, 3] - [4, 8])
      plain_scalar [4, 3] - [4, 8])
        string_scalar [4, 3] - [4, 8])
    value: flow_node [4, 10] - [4, 20])
      plain_scalar [4, 10] - [4, 20])
        string_scalar [4, 10] - [4, 20])

What is happening here is that parser doesn't parse after title: Sample API. This makes it pretty unusable in for example Editor cases. Only part of the source string is parsed and the rest is either ignored or consumed by parser. I would expect CST to contain at least another Error object with additional children nodes.

Did you ever bumped into this issue?

Thanks for any answer

The text was updated successfully, but these errors were encountered:

char0n · 2020-12-14T21:15:03Z

Here is an example how tree-sitter-json produces it's CST.

Given

{a*

The resulting CST would be:

   Error
     Literal - {
     Error - a*

As we can see tree-sitter-json parses entire source string and creates CST of Errors and Literals covering entire source string.

ikatyang · 2021-01-01T10:54:33Z

The issue here is that I did not use the internal lexer, and I only produces valid tokens, so that there is no token produced if it encountered invalid state. And to increase the parsing speed, I intentionally wrote the lexer with avoiding forking the tree in mind, which in practice combines multiple tokens into ones, that's the reason why error recovery does not work here. I guess this issue can be fixed by producing tokens more granularly and loosely though I'm not sure what the performance impact will be, I'll see what I can do.

char0n · 2021-01-01T15:24:07Z

@ikatyang thanks for describing the problem. That explains a lot.

Error recovery and parsing the entire source string are two most important things that we were looking for with tree-sitter. I do have couple of additional questions, if you don't mind

1. Is there any chance this grammar will choose to use internal tree sitter lexer in future?
2. How complicated it would be to fork this grammar and use internal tree sitter lexer?

I'm not sure what the performance impact will be

I think people use tree-sitter for two reasons - unmatched error recovery ability and performance. Unfortunately this grammar is killing error recovery in favor of performance.

I'll see what I can do.

Thank you very much. We already have a very ugly workaround for this by creating surrogate ERROR node with unparsed source string. If in the first iteration we can make this grammar parse entire string it would be a grand start.

ikatyang · 2021-01-02T04:00:01Z

Is there any chance this grammar will choose to use internal tree sitter lexer in future?

The external lexer must be involved since there some tokens that are not possible to be described by the grammar.js, e.g., the conceptual token (i.e., indent/dedent/etc.) and the lookahead token (i.e., (?=x)/(?!x) in regex). The only question is the strategy of using internal/external lexer, this repo is currently using 100% external lexer because of the better controllability, but it does not seem to be necessary after second thought, I'll probably switch to only use external lexer if necessary to take advantage of features provided by the internal lexer, i.e., better error recovery and machine-generated tokenizer.

How complicated it would be to fork this grammar and use internal tree sitter lexer?

It'd be basically a new project with only reusable test cases IMO.

Unfortunately thing grammar is killing error recovery in favor of performance.

Sorry, I forgot to confirm if it works well in invalid state, I only checked if it shows errors in error state since I originally want to use it as an AST parser 😅.

If in the first iteration we can make this grammar parse entire string it would be a grand start.

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

char0n · 2021-01-02T07:46:44Z

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

No worries, we already have a quick & dirty workaround (creating surrogate errors). No need for another one. We'll wait for some systematic solution on master branch.

char0n · 2021-03-09T13:44:04Z

Hi @ikatyang,

Just pinging if there is any plan for this issue in some foreseeable feature

ikatyang · 2021-03-10T00:54:24Z

It's still on my TODO list but I do not have time to work on it recently so I cannot give you an ETA for it.

char0n · 2024-11-12T09:39:25Z

Closing in favor of tree-sitter-grammars#14

char0n mentioned this issue Dec 14, 2020

Make error recovery in YAML adapters better swagger-api/apidom#194

Open

ikatyang added the enhancement New feature or request label Jan 1, 2021

LuisPatino92 mentioned this issue Feb 12, 2024

Change official YAML parser tree-sitter/tree-sitter#3005

Closed

char0n closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors #15

Errors #15

char0n commented Dec 14, 2020 •

edited

Loading

char0n commented Dec 14, 2020

ikatyang commented Jan 1, 2021

char0n commented Jan 1, 2021 •

edited

Loading

ikatyang commented Jan 2, 2021

char0n commented Jan 2, 2021

char0n commented Mar 9, 2021

ikatyang commented Mar 10, 2021

char0n commented Nov 12, 2024

Errors #15

Errors #15

Comments

char0n commented Dec 14, 2020 • edited Loading

char0n commented Dec 14, 2020

ikatyang commented Jan 1, 2021

char0n commented Jan 1, 2021 • edited Loading

ikatyang commented Jan 2, 2021

char0n commented Jan 2, 2021

char0n commented Mar 9, 2021

ikatyang commented Mar 10, 2021

char0n commented Nov 12, 2024

char0n commented Dec 14, 2020 •

edited

Loading

char0n commented Jan 1, 2021 •

edited

Loading