Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JavaScript] Improve parsing of class names. #1614

Merged
merged 5 commits into from
Jun 18, 2018

Conversation

Thom1729
Copy link
Collaborator

@Thom1729 Thom1729 commented May 24, 2018

This duplicates #1581, but an additional change was needed to fix another corner case from #1600. In effect, this is an extension of the approach from #1581.

null|true|false
){{identifier_break}}

non_reserved_identifier: (?:(?!{{reserved_word}}){{identifier}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we can find solutions without such sophisticated lookaheads as they together with heavy usage of multi-push/set can have a significant impact on parsing performance. Such constructs are nice in order to reduce duplicate rules, but can even simpler lookaheads can easily slow down parsing by 20%.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that there could be a performance penalty for lookahead. My impression was that the custom regexp engine compiled regexps to state machines, which shouldn't care how complex the expression is. Is there more information on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not from official sources. I just realized in while writing my CNC package and working on fixes/changes on other syntax definitions.

It's somehow a bad habit of mine to always falsify/verify the way I am going. Therefore I always try different ways to solve a problem to be able to choose the best one. So I did while learning sublime-syntax grammer.

With a little look behind in history to times of tmLanguage, the only way to tokenize text was by a flat list of regexp rules. Each rule needed to uniquely identify a piece of text as whatever it is meant for. With sublime-syntax a powerful feature - the contexts - was introduced, which enables splitting a file into different pieces which can be parsed with different subsets of regexp rules.

The first approaches of writing syntaxes therefore looked as follows with the only goal to just limit the number of rules being applied in certain situations. Neither order nor amount of certain keywords following was of any interest.

main: 
   - match: \{
     scope: punctuation.new-subset.begin
     push:
         - match: \w+
            scope: punctuation.new-subset.end
            pop: true
         - match: \w+
            scope: variable
         - match: '-+*/'
            scope: keyword.operator

By growing knowledge about another feature - "pushing/setting multiple contexts after each other" - syntax definitions try more and more to strictly enforce the order or number of allowed expressions/keywords. In other words they try to do static code analysis and linting with heavy invalid-scoping - or just tend to split the code in too small and strict pieces. The result of such enforced rules is syntax highlighting to break much more likely as

  1. the complex context structure works for complete and valid statements only
  2. the complexity of contexts leads to a high risk of bugs triggered by unusual but valid expressions/statements/coding styles.
  3. the complexity and strictness also increases the number of edge cases

Even though pushing/setting multiple contexts to the stack after each other might be required to achieve robust syntax highlighting in some situations and it helps defining reusable general purpose contexts, it should IMHO not be over-used, as each context in the chain needs to be popped off before the next can be set/pushed to stack. Each context often matches one real expression which is then consumed by the regexp engine, but at least one look-ahead like (?=\S) which need to be evaluated and thrown away afterwards - only one char in this case. But with more complex look-aheads the engine might need to evaluate the whole code twice or three times in the background.

Most statements can be parsed using simple push-set-set... chains with best performance results.

To illustrate what I mean, I created a little repo. It contains something I am working on at the moment. You'll find two ways to do the same thing. While the DOCTYPE1 uses simple push-set chains, the DOCTYPE2 uses multi-push with lookaheads. The later one is about 20 to 25% slower. To benchmark such things I started to write syntax_test_perf files with about 100k of lines of code of certain constructs to clearly see differences. One can ask itself whether it is useful or whether certain statements will occur in such numbers in the real life. If not it may not be worth thinking about, but to achieve a good overall performance everything counts. I just have all the complaints about poor performance with large files in mind then.

https://github.com/deathaxe/sublime-syntax-benchs

We need to keep in mind:

  1. We don't write a compiler
  2. We don't write a linter
  3. We don't write a code analysis tool
  4. We want syntax highlighting to work as seamlessly as possible while writing code, which is by nature invalid while writing it!

What many syntax_tests lack of are tests for incomplete or partly valid statements and expressions to evaluate how the syntax definition handles them or the valid following statements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with the benchmarks, and it looks like more than all of the slowdown is explained by the negative lookahead. When you factor that out into another rule, the multi-push approach is very slightly faster than the push-set-set approach (low single percentages, but consistent).

Removing the capture groups from the expression made both definitions faster, but it also eliminated the majority of the slowdown from the negative lookahead. Simplifying the expression further nearly eliminated the difference (low single percentages, somewhat inconsistent).

It seems that if there is a performance penalty to lookahead assertions, it is only significant when the expression is complicated. I tested 50,000 class expressions, and the lookahead didn't seem to make any measurable difference.

I haven't seen the internals of Sublime's regexp engine, but it is generally presumed to be implemented using state machines. The textbook compilation process is regular expression → nondeterministic finite automaton (by Thompson's construction) → deterministic finite automaton (by the powerset construction). This process does not account for captures; various tagged automata can be used to implement them. Lookaheads are slightly more complicated. The most obvious solution is to compile regular expression → tagged alternating finite automaton (via modified Thompson's) → tagged NFA (via modified powerset) → tagged DFA (via modified powerset). I'm not aware of any formal description of tagged alternating finite automata, but they would seem to follow straightforwardly from the various tagging approaches.

The powerset construction has worst-case exponential blowup, and this approach uses two of them, but the average-case blowup is much smaller. (In some cases, I've seen the number of states shrink after the powerset construction.) As a rule of thumb, the blowup from the powerset construction when merging two regexps with the | operator, or when merging two lookaheads or a lookahead with a non-lookahead, should be from the sum of the regexps' individual complexities to their product. In other words, the blowup should be capped by a factor of the complexity of the simpler operand. In the case of the doctype example, the lookahead and the base expression are both complicated, leading to large blowup. When the base expression is simplified, the blowup disappears. When matching JavaScript class names, the lookahead is complicated but the rest of the match is very simple, so the blowup is negligible.

Although I can't be sure without seeing the internals of Sublime's regexp engine, this model seems to explain the observed behavior.

On a mostly-unrelated note, lookaheads can result in multiple rules scanning the same text if the lookahead extends past the base expression. Obviously, this is duplicated work, and there must be a performance cost. When the lookahead extends only one character past the base expression, this cost is very small. If it could extend very far, the cost could be greater. (The line lookahead limitation does put an upper bound on this.) When I see large lookaheads in syntax definitions, it's usually because there is no alternative. The JavaScript syntax contains many examples, such as identifying arrow functions and method definitions. Although a cover grammar could parse such constructs, it couldn't properly highlight them. (In many cases, we see bugs where newlines interfere with the lookaheads, and we use cover-grammar-like constructs as the recovery behavior.)

Nondeterministic parsing would allow us to eliminate most extended lookahead. This is no free lunch -- the implicit reparsing of the lookaheads is made explicit. However, the implementation I proposed would actually avoid reparsing in the common case. The cost is that in the uncommon case, Sublime would have to rewind the token stream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the stack use, my impression is that pushing and popping contexts is cheap, and that the expensive operations are running the regexps and (maybe) emitting tokens. The only concern I would have is about running "dummy" regexps to auto-pop; however, anecdotally the custom regexp engine seems to have very little overhead (at runtime), so matching '' or (?=\S) is extremely cheap.

I have generally found that converting a syntax to use the stack more results in a noticeable speedup. This proved true for the JavaScript syntax. I would tend to attribute this to smaller contexts with simpler regexps.

Personally, I think that this approach makes error handling easier too. When you expect a bunch of things in order, you can push a context for each thing. If a thing is missing, that context should pop. An illustrative example is #1606: the pop was missing, and adding it made the construct fail gracefully. I find this more difficult to do when using set.

Copy link
Collaborator

@deathaxe deathaxe May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the fact I did not yet study the internals of regexp engines or the related NFA, I can confirm the very most of your statements.

  • negative lookaheads are expensive
  • eliminating captures improves speed
  • push/set is cheap
  • matching '' or (?=\S) is cheap
  • simpler regexp and smaller contexts are much better to maintain
  • line endings cause bugs with lookaheads of statements which can span multiple lines
  • using set is difficult

The set should indeed be used in deterministic chains only. It can make things much more difficult to handle and always requires to be very careful where to include such rules. One always needs to care about how it breaks the meta scopes of owning contexts. But it's a bit faster than pushing.

In the example of the indeed quite complicated qualified_tag_name I found the performance not to differ when replacing it by multiple smaller matches, even though my general experience covers yours. Pushing/setting smaller contexts with simpler rules which avoid captures results in better parsing performance.

I can't confirm your statement about multiple push/sets to be slightly faster. In all case studies I did they were a bit slower, if they really do the same thing as the push/set/set chains. In most cases they were faster, I forgot something.

But in general I really like this approach because of the ability to write reusable contexts. Maybe this heavily depends on the structure of required regexps or I still doing something wrong. But directly setting the next context rather than popping off each one to set the next seems faster.

To proof it, I committed a modification to my example, which makes both variants absolutely equal except the way the contexts are pushed to stack after each other. The reason why the push/set/set was slower during your tests was the missing (?=\S) I added to the DOCTYPE 1 example now.

EDIT: A possible performance boost might be achieved by smarter parsing strategies when working with multiple pushs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small bug in the revised benchmark. In XML DOCTYPE 1 (push-set-set), the contexts set on (?=\S). In XML DOCTYPE 2 (multi-push), those contexts include: tag-end-else-pop, which means that they pop on (?=\s<) as well.

To compare them, I used a single pop or set expression:

pop on push-set-set multi-push
(?=\S) 187 ms 189 ms
(?=\S|\s<) 188 ms 190 ms

For reference, the original multi-push (using two separate pop rules) took 193 ms.

We can see that in these examples push and set are nearly indistinguishable. However, there is a tiny but consistent gap. On the face of it, this doesn't make sense. If a set is equivalent to a push and a pop, then the two examples should be doing exactly the same operations.

However, particular, set rules do not apply clear_scopes for the matched characters, and they interact differently with meta_content_scope. More details here. The time savings we see using set may be the time it takes to check the scopes for clear_scopes rules. If so, the performance difference would vanish when that bug is fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Can reproduce your results. After all the performance gap is really negligible, while code on the other hand can be much more modular and cleaner. I rewrote some of my XML stuff to use multiple push today. By removing several captures the overall performance was even a little bit better in the end. Cool.

You are right with set and meta_content_scope. It's a nasty thing. It was discussed already and decided to keep it as is because changing the behavior would break a lot of default syntaxes, which already worked around this bug.

I know about the prototype is not to be applied to the main context if a syntax is included in another one. Always tried to avoid it as well as the clear_scope. So I can't say much about.

But I guess from all the recent issues about the sanity limits, the engine needs some tweaks anyway to let ST keep working well in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the process of implementing the Sublime parser in JavaScript. (I still have to implement cross-syntax includes, embed/escape, and with_prototype, but otherwise it should be an exact replica, including tokenization behavior.)

When it's done, I hope to be able to implement suggested fixes that Sublime could then adopt. The first item on that list is combining embed/with_prototype to solve context recursion. I'd also like to implement nondeterminism and parameterized contexts.

The code is messy, undocumented, and in the middle of several different refactors, but you can see some of the odd quirks in Sublime's parsing algorithm. For instance:

  • set rules don't apply clear_scopes to the matched text (as noted above).
  • push rules apply clear_scopes and meta_scopes to the matched text in the wrong order.
  • Tokens are broken at the boundary of capture groups with no associated scope for set, push, and pop rules, but not for plain match rules.
  • Capture groups are sorted in a way that fails in some corner cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a couple of easy to reproduce examples added to your CoreIssue would help the devs much to identify the main glitches. Maybe some syntax_test snippets with what you'd expect and what happens. Something they could directly use to check their modified implementation to.

Copy link
Contributor

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't negative lookahead push this into Oniguruma territory? Is the JS mode already using Oniguruma, or would this represent a deoptimization?

@Thom1729
Copy link
Collaborator Author

Negative lookahead is fine; lookbehind isn't.

@wbond wbond merged commit 636a6d9 into sublimehq:master Jun 18, 2018
@Thom1729 Thom1729 deleted the javascript-class-names branch June 18, 2018 18:25
charlievieth pushed a commit to charlievieth/Packages that referenced this pull request Jul 25, 2018
deathaxe pushed a commit to deathaxe/sublime-packages that referenced this pull request Jun 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants