Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss #40

Merged
merged 2 commits into from
Sep 21, 2015

Conversation

alexdima
Copy link
Contributor

@kevinsawicki @zcbenz I would be most thankful if you take a look at this PR, as both VSCode and Atom would benefit from it :).

TL;DR
maxCachedIndex is a safeguard to protect against unitialized access to results (to ensure the regex at a certain index has been tried at least once), therefore it should only grow in the Search method, since it gets correctly reset to -1 in the Clear method.

Long version
I have noticed the following wrong behavior through logging and running the first-mate benchmark. Suppose the following example:

  • There is a scanner with 3 regular expressions that is asked to scan repeatedly over a string of 31k chars.
  • the first time it is asked to scan (location 0):
    • regex0: cache miss, matches at location 3
    • regex1: cache miss, matches at location 30245
    • regex2: cache miss, matches at location 0
  • scanning will stop, suppose regex 2 consumes 4 characters
  • the scanner will now scan for location 4 in the same string
    • regex0: cache miss (due to result->LocationAt(0) >= byteOffset)
    • maxCachedIndex gets incorrectly set to 0
    • regex1: cache miss, the regex looks again at the same string to match again at location 30245

Turns out this makes a huge difference in practice:

Before my proposed change:

c:\Alex\src\first-mate>npm run benchmark

> [email protected] benchmark c:\Alex\src\first-mate
> coffee benchmark/benchmark.coffee


Tokenizing jQuery v2.0.3
Generated tokens for 8830 lines in 718ms (0 tokens/ms)

Tokenizing jQuery v2.0.3 minified
Generated tokens for 7 lines in 26253ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1
Generated tokens for 5786 lines in 281ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1 minified
Generated tokens for 7 lines in 15196ms (0 tokens/ms)

After my proposed change:

c:\Alex\src\first-mate>npm run benchmark

> [email protected] benchmark c:\Alex\src\first-mate
> coffee benchmark/benchmark.coffee


Tokenizing jQuery v2.0.3
Generated tokens for 8830 lines in 678ms (0 tokens/ms)

Tokenizing jQuery v2.0.3 minified
Generated tokens for 7 lines in 1084ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1
Generated tokens for 5786 lines in 266ms (0 tokens/ms)

Tokenizing Bootstrap CSS v3.1.1 minified
Generated tokens for 7 lines in 1744ms (0 tokens/ms)

About running the benchmarks:

  • using node v0.10.40
  • running on Windows
  • I am not familiar enough with v8, but I had to do the following unrelated change to even get the string cache to work (IsSame was always false for me in this node version). I ran both benchmarks with this change in, without this I gave up after 20min of waiting:
bool OnigStringContext::IsSame(Handle<String> other) const {
    if (v8String->Length() != other->Length()) {
        return false;
    }
    return v8String->StrictEquals(other);
    //return v8String == other;
}
  • I had to modify the first-mate benchmark.coffee to not count resulting tokens because the returned format from tokenizeLines has changed

I would be very thankful if you cherry pick this to master and also publish a new v4 on npm with this change, if you agree with my reasoning.

Thank you

@zcbenz
Copy link
Contributor

zcbenz commented Sep 21, 2015

This change looks good to me, and the improvement is awesome!

I am not familiar enough with v8, but I had to do the following unrelated change to even get the string cache to work (IsSame was always false for me in this node version). I ran both benchmarks with this change in, without this I gave up after 20min of waiting:

Current implementation of OnigStringContext::IsSame is wrong, can you replace it with your pasted one? We don't need to compare the length of string though, the StrictEquals method already included many optimizations.

I would be very thankful if you cherry pick this to master and also publish a new v4 on npm with this change, if you agree with my reasoning.

It sounds good to me.

@alexdima
Copy link
Contributor Author

@zcbenz Thank you for the feedback.

In node v0.12 Persistent doesn't inherit anymore from Handle, so looking for examples at https://strongloop.com/strongblog/node-js-v0-12-c-apis-breaking/

here's what I came up with:

bool OnigStringContext::IsSame(Handle<String> other) const {
#if (0 == NODE_MAJOR_VERSION && 11 <= NODE_MINOR_VERSION) || (1 <= NODE_MAJOR_VERSION)
  return other->StrictEquals(v8::Local<String>::New(Isolate::GetCurrent(), v8String));
#else
  return other->StrictEquals(v8String);
#endif
}

Is this a good way to do it? (i.e. no memory leaks due to Local::New ?)

Thank you

@zcbenz
Copy link
Contributor

zcbenz commented Sep 21, 2015

It looks perfect to me, thanks!

zcbenz added a commit that referenced this pull request Sep 21, 2015
25x speed improvement in tokenizing jquery.min: maxCachedIndex should not dicrease due to a cache miss
@zcbenz zcbenz merged commit 7da0686 into atom:v4.x Sep 21, 2015
@kevinsawicki
Copy link
Contributor

Thanks so much for this 🐎

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants