Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] optimize CSS class queries #2135

Closed
3 tasks done
flavorjones opened this issue Dec 17, 2020 · 2 comments
Closed
3 tasks done

[feature] optimize CSS class queries #2135

flavorjones opened this issue Dec 17, 2020 · 2 comments

Comments

@flavorjones
Copy link
Member

flavorjones commented Dec 17, 2020

What problem are you trying to solve?

Currently (v1.10.10), Nokogiri turns the CSS query .red into the XPath query:

"//*[contains(concat(' ', normalize-space(@class), ' '), ' red ')]"

which is doing a lot of string manipulation under the hood:

  • normalize-space creates a new string buffer and assembles it one byte at a time, cleaning up whitespace along the way (xpath.c:xmlXPathNormalizeFunction), then strdups that string
  • concat is pretty expensive, allocating new strings and repeatedly calling strlen and strdup

It occurs to me that we could probably optimize this by registering a C function that would search through the class attribute looking for a class name match without all of the associated string manipulation.

Let's tentatively call that function css-class, and this would be the generated XPath:

"//*[css-class(@class, 'red')]"

I think what I'd like to do is:

  • benchmark the current XPath query in a C implementation
  • write the css-class XPath function in C and benchmark it
  • if this seems like a win, then let's update Nokogiri to use it, and benchmark before-and-after in Ruby
@flavorjones
Copy link
Member Author

OK, I wrote a C function and wired it into a benchmark for XPath, and the TL;DR is that it's about 2x as fast. Here's the result:

searching '../test/files/tlm.html' with '//*[contains(concat(' ', normalize-space(@class), ' '), ' kw3 ')]' 10000 times
NODESET with 25 results
8131 ms
searching '../test/files/tlm.html' with '//*[nokogiri-css-class(@class, 'kw3')]' 10000 times
NODESET with 25 results
4246 ms

I'm going to clean it up and ship a PR.

flavorjones added a commit that referenced this issue Dec 18, 2020
available as `nokogiri-builtin:css-class`

Part of #2135
@flavorjones
Copy link
Member Author

PR is at #2137, but so I can wrap this issue up, the "builtin" method is ~2x faster on CRuby, but almost ~2x slower on JRuby for reasons that completely escape me. Here are the benchmarks:

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        71.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                       135.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        681.312  (± 5.9%) i/s -      6.816k in  10.041631s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                          1.343k (± 5.9%) i/s -     13.500k in  10.090504s

Comparison:
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):     1343.0 i/s
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      681.3 i/s - 1.97x  (± 0.00) slower

jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 on 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 [linux-x86_64]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        74.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        41.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        814.536  (± 9.6%) i/s -      8.066k in  10.022432s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        443.781  (± 6.8%) i/s -      4.428k in  10.029857s

Comparison:
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      814.5 i/s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):      443.8 i/s - 1.84x  (± 0.00) slower

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant