Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors #15

Closed
char0n opened this issue Dec 14, 2020 · 8 comments
Closed

Errors #15

char0n opened this issue Dec 14, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@char0n
Copy link

char0n commented Dec 14, 2020

Hi @ikatyang,

I've been playing with how Errors and tree-sitter recovery mechanism work in this grammar and found some interesting cases that I'm not sure how to handle.

---
openapi: 3.1.0
x-top-level: value
info:
   title: Sample API
  unknownFixedField: value
  description: Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/)
    or HTML.
  summary: example summary
  termsOfService: Terms of service
  version::: 0.1.9
  x-version: 0.1.9-beta
  license:
    name: Apache License 2.0
    x-fullName: Apache License 2.0
    identifier::: Apache License 2.0
    url: https://www.apache.org/licenses/LICENSE-2.0 

produces

ERROR [0, 0] - [4, 20])
  block_mapping_pair [1, 0] - [1, 14])
    key: flow_node [1, 0] - [1, 7])
      plain_scalar [1, 0] - [1, 7])
        string_scalar [1, 0] - [1, 7])
    value: flow_node [1, 9] - [1, 14])
      plain_scalar [1, 9] - [1, 14])
        string_scalar [1, 9] - [1, 14])
  block_mapping_pair [2, 0] - [2, 18])
    key: flow_node [2, 0] - [2, 11])
      plain_scalar [2, 0] - [2, 11])
        string_scalar [2, 0] - [2, 11])
    value: flow_node [2, 13] - [2, 18])
      plain_scalar [2, 13] - [2, 18])
        string_scalar [2, 13] - [2, 18])
  flow_node [3, 0] - [3, 4])
    plain_scalar [3, 0] - [3, 4])
      string_scalar [3, 0] - [3, 4])
  block_mapping_pair [4, 3] - [4, 20])
    key: flow_node [4, 3] - [4, 8])
      plain_scalar [4, 3] - [4, 8])
        string_scalar [4, 3] - [4, 8])
    value: flow_node [4, 10] - [4, 20])
      plain_scalar [4, 10] - [4, 20])
        string_scalar [4, 10] - [4, 20])

What is happening here is that parser doesn't parse after title: Sample API. This makes it pretty unusable in for example Editor cases. Only part of the source string is parsed and the rest is either ignored or consumed by parser. I would expect CST to contain at least another Error object with additional children nodes.

Did you ever bumped into this issue?

Thanks for any answer

@char0n
Copy link
Author

char0n commented Dec 14, 2020

Here is an example how tree-sitter-json produces it's CST.

Given

{a*

The resulting CST would be:

   Error
     Literal - {
     Error - a*

As we can see tree-sitter-json parses entire source string and creates CST of Errors and Literals covering entire source string.

@ikatyang
Copy link
Owner

ikatyang commented Jan 1, 2021

The issue here is that I did not use the internal lexer, and I only produces valid tokens, so that there is no token produced if it encountered invalid state. And to increase the parsing speed, I intentionally wrote the lexer with avoiding forking the tree in mind, which in practice combines multiple tokens into ones, that's the reason why error recovery does not work here. I guess this issue can be fixed by producing tokens more granularly and loosely though I'm not sure what the performance impact will be, I'll see what I can do.

@ikatyang ikatyang added the enhancement New feature or request label Jan 1, 2021
@char0n
Copy link
Author

char0n commented Jan 1, 2021

@ikatyang thanks for describing the problem. That explains a lot.

Error recovery and parsing the entire source string are two most important things that we were looking for with tree-sitter. I do have couple of additional questions, if you don't mind

1. Is there any chance this grammar will choose to use internal tree sitter lexer in future?
2. How complicated it would be to fork this grammar and use internal tree sitter lexer?

I'm not sure what the performance impact will be

I think people use tree-sitter for two reasons - unmatched error recovery ability and performance. Unfortunately this grammar is killing error recovery in favor of performance.

I'll see what I can do.

Thank you very much. We already have a very ugly workaround for this by creating surrogate ERROR node with unparsed source string. If in the first iteration we can make this grammar parse entire string it would be a grand start.

@ikatyang
Copy link
Owner

ikatyang commented Jan 2, 2021

  1. Is there any chance this grammar will choose to use internal tree sitter lexer in future?

The external lexer must be involved since there some tokens that are not possible to be described by the grammar.js, e.g., the conceptual token (i.e., indent/dedent/etc.) and the lookahead token (i.e., (?=x)/(?!x) in regex). The only question is the strategy of using internal/external lexer, this repo is currently using 100% external lexer because of the better controllability, but it does not seem to be necessary after second thought, I'll probably switch to only use external lexer if necessary to take advantage of features provided by the internal lexer, i.e., better error recovery and machine-generated tokenizer.

  1. How complicated it would be to fork this grammar and use internal tree sitter lexer?

It'd be basically a new project with only reusable test cases IMO.

Unfortunately thing grammar is killing error recovery in favor of performance.

Sorry, I forgot to confirm if it works well in invalid state, I only checked if it shows errors in error state since I originally want to use it as an AST parser 😅.

If in the first iteration we can make this grammar parse entire string it would be a grand start.

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

@char0n
Copy link
Author

char0n commented Jan 2, 2021

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

No worries, we already have a quick & dirty workaround (creating surrogate errors). No need for another one. We'll wait for some systematic solution on master branch.

@char0n
Copy link
Author

char0n commented Mar 9, 2021

Hi @ikatyang,

Just pinging if there is any plan for this issue in some foreseeable feature

@ikatyang
Copy link
Owner

It's still on my TODO list but I do not have time to work on it recently so I cannot give you an ETA for it.

@char0n
Copy link
Author

char0n commented Nov 12, 2024

Closing in favor of tree-sitter-grammars#14

@char0n char0n closed this as completed Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants