-
Notifications
You must be signed in to change notification settings - Fork 114
Add :max_parse_errors argument to .parse, .get and .fragment with 0 as default value #65
Conversation
Default value is 100. This avoids potentially extreme memory usage when parsing a document with a huge number of HTML errors.
This seems reasonable to me. Having a default of 0 also seems reasonable. This would also solve the issue of certain mal-formed HTML with embedded zeros causing a crash due to a bug in gumbo, at least as long as no errors were requested. |
ditto |
This makes Gumbo parse error tracking opt-in for users wanting it.
Thanks for the quick feedback guys.
Very good point! I’ve changed the default to 0. |
This check will be falsy most of the time now that 0 is the default.
If nobody opposes, a new gem release would be great. It can solve memory issues any Nokogumbo user may face today. |
Just a small nit: I note that you reordered the check for max_errors first. It seems to me that you could compute the length of the output array once and use it both in the call to |
I realized Gumbo itself actually supports a So I reverted a bunch of the code I did to instead initiate Disclaimer: This pull request is my first ever real-life C code. I copied the code used to initiate the |
At a quick glance, this seems fine. |
Oh, dear. I'm seeing the following when I run the tests:
Let's see what Travis has to say. |
There's no need for the GumboOptions options = kGumboDefaultOptions;
options.max_errors = 0;
// ...
GumboOutput *output = gumbo_parse_with_options(&options, input, input_len); |
@craigbarnes Good point! I removed the |
Nokogumbo version 1.4.10 introduced tracking of Gumbo parse errors into an
@errors
array on the document. Since upgrading to this version, we started seeing our servers use crazy amounts of memory, swap, and almost instantly crash on some occasions.Background: we run an email client. Needless to say, we get our fair share of malformed HTML. The specific case that led me to investigate this memory issue was a newsletter that looked legit but ended with the string
"</body></html></table></div>"
repeated 20,000+ times, with no newline. This means 60,000+ parse errors pushed into@errors
which, probably due to each error containing a string of the full HTML line where it happened, caused insane memory usage.The only solution for us was to lock Nokogumbo to 1.4.9 in our Gemfile, because this part of code being a C extension, we cannot monkey-patch some Ruby to remove the
@errors
functionality.If you ask me, I’d say parse error tracking should have been an opt-in feature in the first place. I believe a tiny fraction of Nokogumbo users need it, and for everyone else it’s just useless overhead added to the
parse
method. Yet, version 1.4.10 has been out for over a year and starting to require an explicit argument to get@errors
tracking would break backward compatibility. Thus, I went with an optional:max_parse_errors
argument, allowing us to pass0
to disable the feature. I’ve put a default limit of100
because I think avoiding crashes like explained above for all Nokogumbo users by default is a significant improvement. Something I haven’t done is acceptingmax_parse_errors: nil
as a way to restore the previous limitless behavior. I wasn’t sure whether this was relevant.Despite the backward incompatiblity it would introduce, it’s worth considering that @jeremy’s pull request on html-proofer (which apparently was the main motivation for releasing the feature in Nokogumbo) is still open, and the Nokogumbo documentation doesn’t mention the error tracking functionality anywhere. So if you guys agree that breaking backward compatibility in this case would be negligible, I think making the feature effectively opt-in would be the best scenario. It would allow regular usage of Nokogumbo with 0 performance impact for users not needing the feature. The only change required to my code would be replacing
100
with0
.Happy to hear your thoughts.