[feature] optimize CSS class queries #2135

flavorjones · 2020-12-17T20:13:51Z

What problem are you trying to solve?

Currently (v1.10.10), Nokogiri turns the CSS query .red into the XPath query:

"//*[contains(concat(' ', normalize-space(@class), ' '), ' red ')]"

which is doing a lot of string manipulation under the hood:

normalize-space creates a new string buffer and assembles it one byte at a time, cleaning up whitespace along the way (xpath.c:xmlXPathNormalizeFunction), then strdups that string
concat is pretty expensive, allocating new strings and repeatedly calling strlen and strdup

It occurs to me that we could probably optimize this by registering a C function that would search through the class attribute looking for a class name match without all of the associated string manipulation.

Let's tentatively call that function css-class, and this would be the generated XPath:

"//*[css-class(@class, 'red')]"

I think what I'd like to do is:

benchmark the current XPath query in a C implementation
write the css-class XPath function in C and benchmark it
if this seems like a win, then let's update Nokogiri to use it, and benchmark before-and-after in Ruby

The text was updated successfully, but these errors were encountered:

flavorjones · 2020-12-18T03:20:45Z

OK, I wrote a C function and wired it into a benchmark for XPath, and the TL;DR is that it's about 2x as fast. Here's the result:

searching '../test/files/tlm.html' with '//*[contains(concat(' ', normalize-space(@class), ' '), ' kw3 ')]' 10000 times
NODESET with 25 results
8131 ms
searching '../test/files/tlm.html' with '//*[nokogiri-css-class(@class, 'kw3')]' 10000 times
NODESET with 25 results
4246 ms

I'm going to clean it up and ship a PR.

available as `nokogiri-builtin:css-class` Part of #2135

flavorjones · 2020-12-18T19:36:27Z

PR is at #2137, but so I can wrap this issue up, the "builtin" method is ~2x faster on CRuby, but almost ~2x slower on JRuby for reasons that completely escape me. Here are the benchmarks:

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        71.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                       135.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        681.312  (± 5.9%) i/s -      6.816k in  10.041631s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                          1.343k (± 5.9%) i/s -     13.500k in  10.090504s

Comparison:
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):     1343.0 i/s
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      681.3 i/s - 1.97x  (± 0.00) slower

jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 on 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 [linux-x86_64]
Warming up --------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        74.000  i/100ms
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        41.000  i/100ms
Calculating -------------------------------------
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]")
                        814.536  (± 9.6%) i/s -      8.066k in  10.022432s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]")
                        443.781  (± 6.8%) i/s -      4.428k in  10.029857s

Comparison:
xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' xxxx ')]"):      814.5 i/s
xpath("//*[nokogiri-builtin:css-class(@class, 'xxxx')]"):      443.8 i/s - 1.84x  (± 0.00) slower

flavorjones added meta/feature-request topic/performance labels Dec 17, 2020

flavorjones mentioned this issue Dec 17, 2020

[bug] Nokogiri::XML::Searchable#at_css is 3-4 times slower on 2 MB HTML files comparing to 640 KB files #2133

Closed

flavorjones added a commit that referenced this issue Dec 18, 2020

feat: provide an XPath function for fast CSS class lookup

7566d4b

available as `nokogiri-builtin:css-class` Part of #2135

flavorjones mentioned this issue Dec 18, 2020

Explore why the builtin XPath function for CSS class selector is so slow on JRuby #2138

Open

flavorjones closed this as completed in 5d0b7fe Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] optimize CSS class queries #2135

[feature] optimize CSS class queries #2135

flavorjones commented Dec 17, 2020 •

edited

Loading

flavorjones commented Dec 18, 2020

flavorjones commented Dec 18, 2020

[feature] optimize CSS class queries #2135

[feature] optimize CSS class queries #2135

Comments

flavorjones commented Dec 17, 2020 • edited Loading

flavorjones commented Dec 18, 2020

flavorjones commented Dec 18, 2020

flavorjones commented Dec 17, 2020 •

edited

Loading