[POC] Auto generate optimized regex to make .sublime-syntax more maintainable #2044

jfcherng opened this issue Aug 4, 2019 · 20 comments


jfcherng commented Aug 4, 2019


The original discussion is at #1740 (comment).

As a tl;dr, please see this diff. It's really hard to just add new tokens into old rules because those regexes are "optimized" with the cost of readability (maintainability).

Proposed Solution

We store those tokens in a new file, say parser-token.txt which looks like

... (lots of lines)

We write a .sublime-syntax template, say parser-token.tpl.yaml, for it.

%YAML 1.2
name: PHP {#tokens_file#}
hidden: true
scope: source.php

    - match: (\\)?\b{#regex_optimized#}\b
      scope: support.constant.parser-token.php
        1: punctuation.separator.namespace.php

Then, we write a python script, say, to do the following jobs.

  1. load tokens from parser-token.txt
  2. load its corresponding template parser-token.tpl.yaml
  3. generate optimized regex by the content of parser-token.txt
  4. generate compiled syntax file parser-token.sublime-syntax with the template and the regex
    %YAML 1.2
    name: PHP parser-token
    hidden: true
    scope: source.php
        - match: (\\)?\b(?:T_(?:A(?:BSTRACT|ND_EQUAL|RRAY(?:_CAST|)|S)|...)\b
          scope: support.constant.parser-token.php
            1: punctuation.separator.namespace.php
  5. change PHP Source.sublime-syntax to include the compiled syntax file
    -    - match: (\\)?\bT_(RE(TURN|QUIRE(_ONCE)?)|G(OTO|LOBAL)|...)\b
    -      scope: support.constant.parser-token.php
    -      captures:
    -       1: punctuation.separator.namespace.php
    +    - include: 'Packages/PHP/Tokens/compiled/parser-token.sublime-syntax'

Now, we maintain parser-token.txt rather than the unreadable optimized regex.

Proof of Concept

Related Scripts

I might have misunderstood the performance tests Briles did that are linked from #1740, but I thought the outcome was that "compacted" regex keywords were actually slower than human-readable ones.

Collaborator Author

jfcherng commented Aug 4, 2019

Since that PR was merged, I assume that @wbond agreed to expand those compact regexes. If the performance becomes even better after expanding them, then the only gain for a compact one is probably the cached file size which is mentioned in #1740 (comment) by @deathaxe.

The problem becomes that does the cache file size matter? What stops us from expanding them?

FichteFoll commented Aug 4, 2019

What stops us from expanding them?

I believe the only reason that stops us from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

Thom1729 commented Aug 4, 2019

Has anyone replicated the original performance tests using non-capturing groups?

Collaborator Author

jfcherng commented Aug 4, 2019

I believe the only reason from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). can be used to expand those compacted regexes.
But beware that if there is a dash in the regex... \b(?:A|A-B)\b matches only A in string A-B.
Sort components with the following compare function seems to be good.

# used to prevent the following situation (for example, for CSS syntax)
# if there is a dash in the regex, "\b(?:A|A-B)\b" matches only "A" in string "A-B".
def token_sort(a: str, b: str) -> float:
    def has_punctuation(s: str) -> bool:
        return not re.match(r"^[0-9a-zA-Z_]+$", s)

    if a == b:
        return 0

    if a.startswith(b) and has_punctuation(a):
        return -1

    if b.startswith(a) and has_punctuation(b):
        return 1

    if a > b:
        return 1

    return -1

Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive.

I think we do just like that PR
Each token should be plain text (no regex is allowed, not even parentheses).

Has anyone replicated the original performance tests using non-capturing groups?

Interesting. There are indeed quite lots of unused capture groups in the original comparison test.

Collaborator Author

jfcherng commented Aug 5, 2019

Has anyone replicated the original performance tests using non-capturing groups?

I use the test case privided in #349 (comment).

The average time it takes to parse the test case over 10 runs on my machine:

  • Current master: 29X ms

    | top|opacity|cursor|background-image|right|visibility|box-sizing
    | user-select|left|float|margin-left|margin-top|line-height
    | padding-left|z-index|margin-bottom|margin-right|margin
    | vertical-align|padding-top|white-space|border-radius|padding-bottom
    | padding-right|padding|bottom|clear|max-width|box-shadow|content
    | border-color|min-height|min-width|font-style|border-width
    | border-collapse|background-size|text-overflow|max-height|text-transform
    | text-shadow|text-indent|border-style|overflow-y|list-style-type
    | word-wrap|border-spacing|appearance|zoom|overflow-x|border-top-left-radius
    | border-bottom-left-radius|border-top-color|pointer-events
    | border-bottom-color|align-items|justify-content|letter-spacing
    | border-top-right-radius|border-bottom-right-radius|border-right-width
    | font-smoothing|border-bottom-width|border-right-color|direction
    | border-top-width|src|border-left-color|border-left-width
    | tap-highlight-color|table-layout|background-clip|word-break
    | transform-origin|resize|filter|backface-visibility|text-rendering
    | box-orient|transition-property|transition-duration|word-spacing
    | quotes|outline-offset|animation-timing-function|animation-duration
    | animation-name|transition-timing-function|border-bottom-style
    | border-bottom|transition-delay|transition|unicode-bidi|border-top-style
    | border-top|unicode-range|list-style-position|orphans|outline-width
    | line-clamp|order|flex-direction|box-pack|animation-fill-mode
    | outline-color|list-style-image|list-style|touch-action|flex-grow
    | border-left-style|border-left|animation-iteration-count
    | page-break-inside|box-flex|box-align|page-break-after|animation-delay
    | widows|border-right-style|border-right|flex-align|outline-style
    | outline|background-origin|animation-direction|fill-opacity
    | background-attachment|flex-wrap|transform-style|counter-increment
    | overflow-wrap|counter-reset|animation-play-state|animation
    | will-change|box-ordinal-group|image-rendering|mask-image|flex-flow
    | background-position-y|stroke-width|background-position-x|background-position
    | background-blend-mode|flex-shrink|flex-basis|flex-order|flex-item-align
    | flex-line-pack|flex-negative|flex-pack|flex-positive|flex-preferred-size
    | flex|user-drag|font-stretch|column-count|empty-cells|align-self
    | caption-side|mask-size|column-gap|mask-repeat|box-direction
    | font-feature-settings|mask-position|align-content|object-fit
    | columns|text-fill-color|clip-path|stop-color|font-kerning
    | page-break-before|stroke-dasharray|size|fill-rule|border-image-slice
    | column-width|break-inside|column-break-before|border-image-width
    | stroke-dashoffset|border-image-repeat|border-image-outset|line-break
    | stroke-linejoin|stroke-linecap|stroke-miterlimit|stroke-opacity
    | stroke|shape-rendering|border-image-source|border-image|border
    | tab-size|writing-mode|perspective-origin-y|perspective-origin-x
    | perspective-origin|perspective|text-align-last|text-align|clip-rule
    | clip|text-anchor|column-rule-color|box-decoration-break|column-fill
    | fill|column-rule-style|mix-blend-mode|text-emphasis-color
    | baseline-shift|dominant-baseline|page|alignment-baseline
    | column-rule-width|column-rule|break-after|font-variant-ligatures
    | transform-origin-y|transform-origin-x|transform|object-position
    | break-before|column-span|isolation|shape-outside|all
    | color-interpolation-filters|marker|marker-end|marker-start
    | marker-mid|color-rendering|color-interpolation|background-repeat-x
    | background-repeat-y|background-repeat|background|mask-type
    | flood-color|flood-opacity|text-orientation|mask-composite
    | text-emphasis-style|paint-order|lighting-color|shape-margin
    | text-emphasis-position|text-emphasis|shape-image-threshold
    | mask-clip|mask-origin|mask|font-variant-caps|font-variant-alternates
    | font-variant-east-asian|font-variant-numeric|font-variant-position
    | font-variant|font-size-adjust|font-size|font-language-override
    | font-display|font-synthesis|font|line-box-contain|text-justify
    | text-decoration-color|text-decoration-style|text-decoration-line
    | text-decoration|text-underline-position|grid-template-rows
    | grid-template-columns|grid-template-areas|grid-template|rotate|scale
    | translate|scroll-behavior|grid-column-start|grid-column-end
    | grid-column-gap|grid-row-start|grid-row-end|grid-auto-rows
    | grid-area|grid-auto-flow|grid-auto-columns|image-orientation
    | hyphens|overflow-scrolling|overflow|color-profile|kerning
    | nbsp-mode|color|image-resolution|grid-row-gap|grid-row|grid-column
    | blend-mode|azimuth|pause-after|pause-before|pause|pitch-range|pitch
    | text-height|system|negative|prefix|suffix|range|pad|fallback
    | additive-symbols|symbols|speak-as|speak|grid-gap
  • Current master + Undo the PR + Max Compacted: 29X ms


    Generated using this POC.


All cases are quite the same.

In @Briles' benchmark results, the best case (current master) is 348.58 ms and the worst case Max Compacted is 477.68 ms. I cannot reproduce that difference. In my test, the Max Compacted case is no noticeable difference from a Usage Sorted (current master branch's, untouched) one.

Here is the fully max compacted version CSS syntax modified from the current master branch. But, again, no noticeable performance difference, still 29X ms.

The cached file size Sublime Text 3/Cache/CSS/CSS.sublime-syntax.rcache:

  • Current master: 146 KB
  • After undoing the PR: 145 KB
  • fully max compacted: 123 KB

Thom1729 commented Aug 5, 2019

I've always been under the impression that Sublime compiled all of the regexps from a context into a single finite state machine. Once this is done, matching is incredibly simple, and performance is guaranteed linear in the number of characters examined:

bool match(
    int initialState,
    int* doesStateAccept,
    int** transitionTable,
    char* string,
    int begin,
    int end,
) {
    int state = initialState;
    for (int i = begin; i < end; i++) {
        state = transitionTable[state][string[i]];
        if (doesStateAccept[state]) return true;
    return false;

If two regexps are equivalent (match the same set of strings), then they should produce identical FSMs, no matter how thoroughly golfed or “optimized” the expressions may be. This is in contrast to slower-but-more-powerful backtracking-based engines like Oniguruma, in which the structure of the expression can greatly affect performance. (This is why Oniguruma has atomic groups, but FSM's don't need them.)

The above does not account for captures, because I don't know how Sublime implements them. Clearly, tracking capture groups requires more work than not tracking them requires. And we know that Sublime doesn't optimize away unnecessary capture groups, because that would visibly affect tokenization. In #1614, we concluded that these extra captures do measurably affect performance, although perhaps the difference is not noticeable at smaller scales.

If rearranging a capture-less, non-Oniguruma regexp without changing its meaning has any consistent measurable effect on parsing performance, then this calls my understanding of Sublime's regexp engine into serious question, and I thought I had a pretty solid handle on this. Perhaps @wbond or someone from Sublime HQ might be able to confirm or refute my assumptions. If it would affect the way we should write the core syntaxes, then it would be nice to get an authoritative statement on this.

Collaborator Author

jfcherng commented Aug 5, 2019

There are quite lots of compacted regex in PHP's syntax.

Now I have a fully expanded PHP syntax file.


The benchmarked file is syntax_test_php.php.

  • Before expanded (current master): 16.X ms on avg over 10 runs
  • After expanded (as above gist): 16.X ms on avg over 10 runs

No noticeable difference.

PHP.sublime-syntax.rcache sizes

  • Before expanded (current master): 3688 KB
  • After expanded (as above gist): 3788 KB


I do another test with this file to test what happens if there are LOTS of functions in a PHP file. I collect all functions that are expanded into the single PHP file as a test case.

  • Before expanded (current master): ~40 ms on avg over 10 runs
  • After expanded (as above gist): ~50 ms on avg over 10 runs

Hmm... 25% extra cost in this case.

Thom1729 commented Aug 5, 2019

Those two syntaxes don't seem to be equivalent. For instance, the support.constant.std.php expression differs: the expanded list contains __COMPILER_HALT_OFFSET__, whereas the compacted version does not seem to match that identifier.

I ran my own benchmark to zero in on the regexp format question. The test syntaxes are all of the following form:

%YAML 1.2
name: PHP Text X
scope: source.php
  name: # expression
    - match: '{{name}}'
      scope: region.redish
      # set: main

The expression is some representation of the support.constant.ext.php expression from the core PHP syntax: either the compact representation as it is in the syntax, or that version with the capturing groups replaced by noncapturing groups, or @jfcherng's fully expanded version. I removed the punctuation match. The compact version with captures:


The expanded version:

  # snip

The sample file looked like this:

// SYNTAX TEST "Packages/PHPTest/PHP Test Compact Capturing.sublime-syntax"


It contained each of the 841 identifiers matched by the regexp. The entire list of identifiers was repeated 300 times, for a total of 252,302 lines including the syntax test declaration and subsequent blank line.

For each of the three expressions, I ran one test with set: main in the match rule and one without. I ran each test three times and took the median, rounded to the nearest millisecond. The results were as follows:

Time Test
167 ms Compact, Capturing
167 ms Compact, Noncapturing
169 ms Compact, Noncapturing
427 ms Set, Compact, Capturing
170 ms Set, Compact, Noncapturing
171 ms Set, Expanded

My conclusions:

  • Capturing groups are expensive. However, because Sublime ignores unused capture groups for rules that do not modify the stack (Syntax highlighting engine quirks sublime_text#2326), in many cases the difference may be optimized away. (It's worth noting that unless they are optimized away, Sublime will break tokens at capture group boundaries, so unnecessary capture groups will affect hashed color scheme rules.)
  • The expanded regexp does seem to be slightly but measurably slower. This could be because the cache file is larger (which in turn is presumably because it preserves the original expression text somewhere), or it could be because I did those tests later and my CPU was a bit warmer. In any case, the difference seems to be trivial.
  • Stack operations are cheap — or, at least, set is cheap when there are no meta scopes or cleared scopes to deal with. Adding the set rule had only a trivial impact on performance — except in the case of the capturing expression, because it stopped Sublime from ignoring the unused capture groups. Presumably the reason for this is so that pushed contexts can refer to the captures with faux-backreferences.

tl;dr: Regexp formatting doesn't matter, but don't use unnecessary capture groups.

If that is true, just use whatever stemming seems natural?

  CURL_HTTP_VERSION_(?:1_0|1_1|NONE)                                                   # Sure.
  |(?:XML_TEXT|CDATA|ENTITY(?:_REF)?)_NODE                                             # Maybe?

deathaxe commented Aug 5, 2019

I confirm all your conclusions. This is exactly what I've experienced so far.

Maybe one addition:

  • Negative lookaheads are more expensive than positive lookaheads with the same functionality by about 5..10%.


- match: \<(?=[^<])

Replace the lookahead by (?!\<) and run the syntax tests against a file with 100k lines containing

print <angular quoted text>;

positive lookahead: 373ms.
negative lookahead: 401ms. (+7.5%)

Thom1729 commented Aug 5, 2019

That's interesting, because both expressions should produce the same FSM — at least, assuming that lookahead is implemented with AFAs and then turned into an NFA via the powerset construction, which is admittedly mere speculation. Maybe there's some sort of special lookahead implementation.

Copy link

deathaxe commented Aug 5, 2019

Somehow the implementation must differ as negative lookaheads work if nested in another one, but positive lhs don't. (?=(?!blabla)) is ok. In (?=bla(?=bla)) the nested one is ignored.

Thom1729 commented Aug 5, 2019

Yeah, I just noticed that when trying to see if I could speed up the JavaScript syntax by rewriting {{identifier_break}} as a positive lookahead. This seems like a rather serious bug.

Copy link

Thom1729 commented Aug 5, 2019

I dug into this and it's even weirder. See referenced issue.

wbond commented Aug 5, 2019

This seems like a rather serious bug.

A serious bug that more or less no one has run into since… people don't like reading regexes like that? :-D

Thom1729 commented Aug 5, 2019

Having plumbed the depths of the bug in the course of writing #2918, my estimation of its severity has lessened, mostly because it only applies in fairly specific circumstances. If it were really as general as it seemed at first, I imagine that someone would have noticed it before now.

Do you have any insight into how capture groups and lookaheads are implemented in the internal engine? Those are the only supported features that don't have a standard, no-brainer FSM implementation, and the evidence seems to indicate that they're not simply compiled into a slightly-fancier FSM.

wbond commented Aug 5, 2019

I don't know off of the top of my head, but I do believe at least some of those details are "proprietary". 🙂

Copy link

@jfcherng As the original topic seems to be moot, should this issue be closed?

Collaborator Author

jfcherng commented Oct 24, 2019

My major concern about the PHP syntax is most likely be kind of solved in #2134 hence closed.

@jfcherng jfcherng mentioned this issue Jan 2, 2025
