-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Auto generate optimized regex to make .sublime-syntax more maintainable #2044
Comments
I might have misunderstood the performance tests Briles did that are linked from #1740, but I thought the outcome was that "compacted" regex keywords were actually slower than human-readable ones. |
Since that PR was merged, I assume that @wbond agreed to expand those compact regexes. If the performance becomes even better after expanding them, then the only gain for a compact one is probably the cached file size which is mentioned in #1740 (comment) by @deathaxe. The problem becomes that does the cache file size matter? What stops us from expanding them? |
I believe the only reason that stops us from doing that currently is that nobody's taken the time to do it (and written a script that automates the process). Note that some human discretion is required to declare a pattern as "human readable" and that human-readable and compacted regexes aren't exclusive. |
Has anyone replicated the original performance tests using non-capturing groups? |
https://github.com/asciimoo/exrex can be used to expand those compacted regexes. # used to prevent the following situation (for example, for CSS syntax)
# if there is a dash in the regex, "\b(?:A|A-B)\b" matches only "A" in string "A-B".
def token_sort(a: str, b: str) -> float:
def has_punctuation(s: str) -> bool:
return not re.match(r"^[0-9a-zA-Z_]+$", s)
if a == b:
return 0
if a.startswith(b) and has_punctuation(a):
return -1
if b.startswith(a) and has_punctuation(b):
return 1
if a > b:
return 1
return -1
I think we do just like that PR https://github.com/sublimehq/Packages/pull/349/files.
Interesting. There are indeed quite lots of unused capture groups in the original comparison test. |
I use the test case privided in #349 (comment). The average time it takes to parse the test case over 10 runs on my machine:
All cases are quite the same. In @Briles' benchmark results, the best case (current master) is 348.58 ms and the worst case Here is the The cached file size
|
I've always been under the impression that Sublime compiled all of the regexps from a context into a single finite state machine. Once this is done, matching is incredibly simple, and performance is guaranteed linear in the number of characters examined: bool match(
int initialState,
int* doesStateAccept,
int** transitionTable,
char* string,
int begin,
int end,
) {
int state = initialState;
for (int i = begin; i < end; i++) {
state = transitionTable[state][string[i]];
if (doesStateAccept[state]) return true;
}
return false;
} If two regexps are equivalent (match the same set of strings), then they should produce identical FSMs, no matter how thoroughly golfed or “optimized” the expressions may be. This is in contrast to slower-but-more-powerful backtracking-based engines like Oniguruma, in which the structure of the expression can greatly affect performance. (This is why Oniguruma has atomic groups, but FSM's don't need them.) The above does not account for captures, because I don't know how Sublime implements them. Clearly, tracking capture groups requires more work than not tracking them requires. And we know that Sublime doesn't optimize away unnecessary capture groups, because that would visibly affect tokenization. In #1614, we concluded that these extra captures do measurably affect performance, although perhaps the difference is not noticeable at smaller scales. If rearranging a capture-less, non-Oniguruma regexp without changing its meaning has any consistent measurable effect on parsing performance, then this calls my understanding of Sublime's regexp engine into serious question, and I thought I had a pretty solid handle on this. Perhaps @wbond or someone from Sublime HQ might be able to confirm or refute my assumptions. If it would affect the way we should write the core syntaxes, then it would be nice to get an authoritative statement on this. |
There are quite lots of compacted regex in PHP's syntax. Now I have a fully expanded PHP syntax file. PerformanceThe benchmarked file is
No noticeable difference.
|
Those two syntaxes don't seem to be equivalent. For instance, the I ran my own benchmark to zero in on the regexp format question. The test syntaxes are all of the following form:
The expression is some representation of the
The expanded version:
The sample file looked like this:
It contained each of the 841 identifiers matched by the regexp. The entire list of identifiers was repeated 300 times, for a total of 252,302 lines including the syntax test declaration and subsequent blank line. For each of the three expressions, I ran one test with
My conclusions:
tl;dr: Regexp formatting doesn't matter, but don't use unnecessary capture groups. |
If that is true, just use whatever stemming seems natural?
|
I confirm all your conclusions. This is exactly what I've experienced so far. Maybe one addition:
Example: Packages/Perl/Perl.sublime-syntax Line 1076 in d1494f4
Replace the lookahead by print <angular quoted text>; positive lookahead: 373ms. |
That's interesting, because both expressions should produce the same FSM — at least, assuming that lookahead is implemented with AFAs and then turned into an NFA via the powerset construction, which is admittedly mere speculation. Maybe there's some sort of special lookahead implementation. |
Somehow the implementation must differ as negative lookaheads work if nested in another one, but positive lhs don't. |
Yeah, I just noticed that when trying to see if I could speed up the JavaScript syntax by rewriting |
I dug into this and it's even weirder. See referenced issue. |
A serious bug that more or less no one has run into since… people don't like reading regexes like that? :-D |
Having plumbed the depths of the bug in the course of writing #2918, my estimation of its severity has lessened, mostly because it only applies in fairly specific circumstances. If it were really as general as it seemed at first, I imagine that someone would have noticed it before now. Do you have any insight into how capture groups and lookaheads are implemented in the internal engine? Those are the only supported features that don't have a standard, no-brainer FSM implementation, and the evidence seems to indicate that they're not simply compiled into a slightly-fancier FSM. |
I don't know off of the top of my head, but I do believe at least some of those details are "proprietary". 🙂 |
@jfcherng As the original topic seems to be moot, should this issue be closed? |
My major concern about the PHP syntax is most likely be kind of solved in #2134 hence closed. |
Problem
The original discussion is at #1740 (comment).
As a tl;dr, please see this diff. It's really hard to just add new tokens into old rules because those regexes are "optimized" with the cost of readability (maintainability).
Proposed Solution
We store those tokens in a new file, say
parser-token.txt
which looks likeWe write a
.sublime-syntax
template, sayparser-token.tpl.yaml
, for it.Then, we write a python script, say
compile_templates.py
, to do the following jobs.parser-token.txt
parser-token.tpl.yaml
parser-token.txt
parser-token.sublime-syntax
with the template and the regexPHP Source.sublime-syntax
to include the compiled syntax fileNow, we maintain
parser-token.txt
rather than the unreadable optimized regex.Proof of Concept
https://github.com/jfcherng/Packages/tree/poc-regex-optimize/PHP/Tokens
Related Scripts
The text was updated successfully, but these errors were encountered: