Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Add :max_parse_errors argument to .parse, .get and .fragment with 0 as default value #65

Merged
merged 4 commits into from
Jan 25, 2018

Conversation

rafbm
Copy link
Contributor

@rafbm rafbm commented Jan 20, 2018

Nokogumbo version 1.4.10 introduced tracking of Gumbo parse errors into an @errors array on the document. Since upgrading to this version, we started seeing our servers use crazy amounts of memory, swap, and almost instantly crash on some occasions.

Background: we run an email client. Needless to say, we get our fair share of malformed HTML. The specific case that led me to investigate this memory issue was a newsletter that looked legit but ended with the string "</body></html></table></div>" repeated 20,000+ times, with no newline. This means 60,000+ parse errors pushed into @errors which, probably due to each error containing a string of the full HTML line where it happened, caused insane memory usage.

The only solution for us was to lock Nokogumbo to 1.4.9 in our Gemfile, because this part of code being a C extension, we cannot monkey-patch some Ruby to remove the @errors functionality.

If you ask me, I’d say parse error tracking should have been an opt-in feature in the first place. I believe a tiny fraction of Nokogumbo users need it, and for everyone else it’s just useless overhead added to the parse method. Yet, version 1.4.10 has been out for over a year and starting to require an explicit argument to get @errors tracking would break backward compatibility. Thus, I went with an optional :max_parse_errors argument, allowing us to pass 0 to disable the feature. I’ve put a default limit of 100 because I think avoiding crashes like explained above for all Nokogumbo users by default is a significant improvement. Something I haven’t done is accepting max_parse_errors: nil as a way to restore the previous limitless behavior. I wasn’t sure whether this was relevant.

Despite the backward incompatiblity it would introduce, it’s worth considering that @jeremy’s pull request on html-proofer (which apparently was the main motivation for releasing the feature in Nokogumbo) is still open, and the Nokogumbo documentation doesn’t mention the error tracking functionality anywhere. So if you guys agree that breaking backward compatibility in this case would be negligible, I think making the feature effectively opt-in would be the best scenario. It would allow regular usage of Nokogumbo with 0 performance impact for users not needing the feature. The only change required to my code would be replacing 100 with 0.

Happy to hear your thoughts.

Default value is 100. This avoids potentially extreme memory usage when parsing a document with a huge number of HTML errors.
@stevecheckoway
Copy link
Collaborator

This seems reasonable to me. Having a default of 0 also seems reasonable.

This would also solve the issue of certain mal-formed HTML with embedded zeros causing a crash due to a bug in gumbo, at least as long as no errors were requested.

@rubys
Copy link
Owner

rubys commented Jan 20, 2018

This seems reasonable to me. Having a default of 0 also seems reasonable.

ditto

This makes Gumbo parse error tracking opt-in for users wanting it.
@rafbm
Copy link
Contributor Author

rafbm commented Jan 20, 2018

Thanks for the quick feedback guys.

This would also solve the issue of certain mal-formed HTML with embedded zeros causing a crash due to a bug in gumbo, at least as long as no errors were requested.

Very good point!

I’ve changed the default to 0.

This check will be falsy most of the time now that 0 is the default.
@rafbm rafbm changed the title Add :max_parse_errors argument to .parse, .get and .fragment with 100 as default value Add :max_parse_errors argument to .parse, .get and .fragment with 0 as default value Jan 20, 2018
@rafbm
Copy link
Contributor Author

rafbm commented Jan 24, 2018

If nobody opposes, a new gem release would be great. It can solve memory issues any Nokogumbo user may face today.

@rubys
Copy link
Owner

rubys commented Jan 24, 2018

Just a small nit: I note that you reordered the check for max_errors first.

It seems to me that you could compute the length of the output array once and use it both in the call to rb_ary_new2 and in the for loop below.

@rafbm
Copy link
Contributor Author

rafbm commented Jan 24, 2018

I realized Gumbo itself actually supports a max_errors option:

https://github.com/google/gumbo-parser/blob/aa91b27b02c0c80c482e24348a457ed7c3c088e0/src/gumbo.h#L588-L595

So I reverted a bunch of the code I did to instead initiate GumboOptions with the max_errors value we want. This ought to be the way to go because it will stop populating not only our Ruby array, but Gumbo’s internal vector too.

Disclaimer: This pull request is my first ever real-life C code. I copied the code used to initiate the GumboOptions from google/gumbo-parser#393 (comment). Compiler then complained that I had to “take the address with &” to reference options, which I did. I can’t tell if I left something ugly like a memory leak behind…

@stevecheckoway
Copy link
Collaborator

At a quick glance, this seems fine.

@rubys rubys merged commit 938dd3f into rubys:master Jan 25, 2018
@rubys
Copy link
Owner

rubys commented Jan 25, 2018

Oh, dear. I'm seeing the following when I run the tests:

# Running:

....F.............E.....

Finished in 0.015147s, 1584.4722 runs/s, 2112.6296 assertions/s.

  1) Failure:
TestNokogumbo#test_parse_fragment_errors [test-nokogumbo.rb:151]:
Expected nil (NilClass) to respond to #empty?.


  2) Error:
TestNokogumbo#test_fragment_default_max_parse_errors:
NoMethodError: undefined method `length' for nil:NilClass
    test-nokogumbo.rb:163:in `test_fragment_default_max_parse_errors'

24 runs, 32 assertions, 1 failures, 1 errors, 0 skips

Let's see what Travis has to say.

@rubys rubys mentioned this pull request Feb 10, 2018
@rafbm rafbm deleted the max-parse-errors branch March 16, 2018 19:57
@craigbarnes
Copy link
Contributor

craigbarnes commented Apr 19, 2018

I copied the code used to initiate the GumboOptions from google/gumbo-parser#393. Compiler then complained that I had to “take the address with &” to reference options, which I did. I can’t tell if I left something ugly like a memory leak behind…

There's no need for the memcpy. You can just use:

GumboOptions options = kGumboDefaultOptions;
options.max_errors = 0;
// ...
GumboOutput *output = gumbo_parse_with_options(&options, input, input_len);

@stevecheckoway
Copy link
Collaborator

@craigbarnes Good point! I removed the memcpy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants