Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighting engine quirks #2326

Closed
Thom1729 opened this issue May 23, 2018 · 15 comments
Closed

Syntax highlighting engine quirks #2326

Thom1729 opened this issue May 23, 2018 · 15 comments

Comments

@Thom1729
Copy link

I've run into some really obscure issues with the syntax highlighting engine.


When you pop a context with a meta_content_scope, that meta_content_scope is not applied to the characters matched by the rule. When you set, the meta_content_scope is applied to those characters. I would expect set to work like pop in this case.

When you push a context with clear_scopes, scopes are cleared for the matched characters. When you set, scopes are not cleared for the matched characters. I would expect set to work like push in this case.

In general, it would be nice if this were refactored a bit so that pop could coexist with push. Then, set could be implemented as a push and pop. This might also make it easy to allow an embed rule with a pop, a common request.


When a rule pushes multiple scopes containing both clear_scopes and meta_scope, when scoping the characters matched by the rule, meta scopes are applied in the wrong order. First, all of the clear_scopes values are applied, then all of the meta_scope values. After the characters are matched, the meta_scope values are removed, then the clear_scopes values are undone. Then the contexts are properly pushed onto the stack and the rules applied in the expected order.

Instead, I would expect that for each context being pushed, first clear_scopes should be applied, and then meta_scope should be applied, and then the next context should be handled. After the characters are matched, this should be undone in reverse order.


When a rule has captures, but some of those captures do not apply any scopes, then the tokenization behavior varies. If the rule uses push, pop, or set, then tokens are always broken on capture group boundaries, even when there is no corresponding scope for the capture rule. If the rule does not use any of those, then capture groups that do not apply a scope are ignored for tokenization. (I have not tested this with embed, so I don't know the behavior there.)

I would expect that capture groups that don't apply a scope either always cause token breaks, or never, regardless of the type of rule. (Never is probably preferable.)


When a rule has multiple captures, and those captures overlap in various ways, the engine will produce surprising results:

  • If a capture with a higher index occurs at an earlier position than a capture with a lower index in the match, the capture with the higher index will be ignored.
  • If two captures begin at the same position, but the capture with the higher index ends at a later position than the capture with the lower index, then both captures' scopes will be applied until that later position, even though one of them should have ended first.

These are odd situations to begin with, and any cleaner solution might have performance implications, but it might be helpful to document the behavior.


If a context is pushed or set from any rule in the prototype, that context will implicitly have meta_include_prototype, as will any context pushed or set from that context, and so on. This applies even if such a context is used elsewhere. This is potentially very surprising, as the behavior of a context varies depending on whether it is transitively included by prototype.

@FichteFoll
Copy link
Collaborator

Your first observation has been discussed before in some place (probably the Packages repo). Iirc the final verdict was that it's weird but cannot be changed because of how much current syntax definitions have to rely or work around this and it would most likely be backwards-incompatible.

I do would like to see this get some consistency because I stumble upon this issue quite frequently when working on a syntax definition. The second issue is probably very related.

I don't think I've run into any of the other issues you describe, but they should be addressed imo.

@Thom1729
Copy link
Author

Having heard some perspectives about set/meta_content_scope, I think that it kind of makes sense, because it allows you to set from one context to another with the same meta_content_scope without leaving a gap.

Given this, factoring out pop would make the most sense. That way, set and push/pop could coexist, treating meta_content_scope differently.

@wbond
Copy link
Member

wbond commented Jan 2, 2019

This might also make it easy to allow an embed rule with a pop, a common request.

Currently it should be possible to pop out of an embed, if I recall correctly.

@wbond
Copy link
Member

wbond commented Jan 2, 2019

When a rule has multiple captures, and those captures overlap in various ways, the engine will produce surprising results:

  • If a capture with a higher index occurs at an earlier position than a capture with a lower index in the match, the capture with the higher index will be ignored.
  • If two captures begin at the same position, but the capture with the higher index ends at a later position than the capture with the lower index, then both captures' scopes will be applied until that later position, even though one of them should have ended first.

These are odd situations to begin with, and any cleaner solution might have performance implications, but it might be helpful to document the behavior.

Could you provide examples of these? I'm not understanding index vs position. In my mind earlier position == lower index, but I must be misunderstanding.

@Thom1729
Copy link
Author

Thom1729 commented Jan 2, 2019

Currently it should be possible to pop out of an embed, if I recall correctly.

I mean that it would be helpful for a single rule to have both an embed and a pop. The engine should pop, then embed. As it stands, when the inner syntax pops, it will effectively pass control to the context that embedded it. This means that this context must do double duty, handling the state before the embed and the state after. These states could be incompatible, leading to bad behavior in corner cases. Ordinarily, you would use set in such a situation, but there is no analogue to set for embed.

An example is HTML.sublime-syntax at line 363 (the script-javascript context). That context handles both the embed and the close tag, but those really should be handled separately. If we could pop from the rule that has embed, then the logic could be simplified.

In addition, there is a weird bug. The following is parsed wrong:

<script></script >
const x = 1;

The </script > is a perfectly valid closing tag, but the script-close-tag:match is not exactly the same as the one in script-javascript:escape. Because script-javascript is doing double duty, the failure mode is to re-embed the JavaScript syntax, which is very bad. If the logic were simplified with the help of embed+pop, the failure mode would have been much less bad. I think that this is a general difficulty: the lack of embed+pop can lead to weird code, which can lead to weird bugs or strange failure modes.

I can also imagine situations in which embed/pop would be absolutely required for correct behavior, although I don't have a real-life example in mind.

Could you provide examples of these? I'm not understanding index vs position. In my mind earlier position == lower index, but I must be misunderstanding.

This can occur with captures in lookaheads:

    - match: (?=foo(bar))(foo)bar
      captures:
        1: region.redish
        2: region.bluish

Given the text foobar, bar will be scoped but foo will not.

It can also occur with capture groups within quantifiers:

    - match: (?:(x)|(y))+
      captures:
        1: region.redish
        2: region.bluish

In a run of x and y characters, this should highlight the last x and the last y, but the last y is only highlighted if it comes after the last x. xy passes, but yx fails.

Admittedly, these are weird situations. (I only ran into them when writing a reference implementation of the sublime-syntax parsing algorithm.) Fixing the capture behavior in the general case would basically require a sort, which could be slow. If a fix is impractical, then merely documenting the current behavior is probably fine.

@FichteFoll
Copy link
Collaborator

The engine should pop, then embed.

What's the argument for this specific order as opposed to the opposite?

I also didn't know capture groups work within look-aheads, tbh.

@wbond
Copy link
Member

wbond commented Jan 3, 2019

I also didn't know capture groups work within look-aheads, tbh.

At one point I tried to remove that functionality from sregex as an optimization to prevent the need for capturing in look-aheads, but I believe it was being used in the default packages somewhere.

@Thom1729
Copy link
Author

Thom1729 commented Jan 4, 2019

What's the argument for this specific order as opposed to the opposite?

If an embed is conceptually a special kind of push, then logically embed-pop should be a no-op, whereas pop-embed should be like a set.

@keith-hall
Copy link
Collaborator

for reference, capture groups within lookarounds were also discussed before at #1796 (comment)

@wbond
Copy link
Member

wbond commented Jun 18, 2020

Fixing the first issue causes 882 assertions to fail in the default syntaxes. Fixing the second issue causes 193 assertions to fail in the default syntaxes.

Based on this, I think most, if not all, of these changes will have to be gated behind a version flag. Even though we can control and fix the default syntaxes, there are most certainly third party syntaxes relying on these.

@deathaxe
Copy link
Collaborator

Agree. This is definitly an issue with a propably wide range of effects with regards of backwards compatibility.

I'd guess most of the failing test cases might be meta. ... scope related though?

What I can say about those "issues" is, that it heavily depends on the strategy how a syntax is designed whether they may be considdered "quirks" or intended behavior or even less relevant.

They have a less impact when using multi-push or multi-set statements.

One can work with those behavior in the one situation while it seems complicating things in others.

But finally it might propably be most consistent to have [meta|clear]_content_scope to be applied to everything within (excluding) the boundaries and [meta|clear]_scope to be applied to everything including the boundaries in all situations (push, set, pop) in a future version of a syntax engine?

@wbond
Copy link
Member

wbond commented Jun 18, 2020

But finally it might propably be most consistent to have [meta|clear]_content_scope to be applied to everything within (excluding) the boundaries and [meta|clear]_scope to be applied to everything including the boundaries in all situations (push, set, pop) in a future version of a syntax engine?

In regards to this general area: set is funky since it is a pop and push, and thus should technically get the meta_scope from both. Unfortunately I think this will make it generally less useful and require more context contortions. My gut instinct here it to treat it as if the pop is a lookahead and the meta_scope should match what a push does.

@deathaxe
Copy link
Collaborator

deathaxe commented Jun 18, 2020

Honestly I am uncertain how to tackle these kinds of issue best, but my feeling is the same as yours in that case. As we are using lookaheads to avoid overlapping meta.scopes I think your proposal is propably the most practical one. It means both, the meta_scope and meta_content_scope of a context are not applied to the matching characters which set another context. With regards to how we handle function definitions' parameter lists for instance this would make sense.

No more lookahead required to avoid overlapping metas.

contexts:
  main:
    - include: function

  function:
    - match: def
      scope:
        meta.function 
        keyword.declaration.function
      push: 
        - meta_content_scope: meta.function.identifier
        - match: \(
          scope: punctuation.section.parens.begin
          set: 
            - meta_scope: meta.function.parameters
            - match: \)
              scope: punctuation.section.parens.end
              pop: 1              

@FichteFoll
Copy link
Collaborator

Note also that not applying a scope can be easily worked around by re-specifying the scope in the pattern's scope but the other way around isn't as easy, so this seems like a good compromise.

@wbond
Copy link
Member

wbond commented Jul 10, 2020

As of build 4075: parts 1, 2, 3 and 5 now act as expected when a .sublime-syntax contains version: 2 as a root level key/value pair. Part 4 is fixed for all syntaxes.

The request for pop and push to exist together was also implemented, although meta scopes are applied differently than set. The meta scopes from any contexts popped off of the stack will not be applied - it is as if the match was a lookahead in regards to the pop.

While pop: 1, push: foo can accomplish the same thing as set: foo, the meta scope will be different, so for backwards compatibility, both actions still exist.

It is also possible to combine pop with embed and branch, and you can specify pop to be true or a number.

@wbond wbond closed this as completed Jul 10, 2020
@wbond wbond added the R: fixed label Jul 10, 2020
@wbond wbond added this to the Build 4075 milestone Jul 10, 2020
@wbond wbond removed their assignment Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants