-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformers memory leak? #129
Comments
Thanks for the report! After some initial investigation, it seems like the leak occurs when certain nodes are unlinked, and it appears to be related to Sanitize's traverse method. When I swap that out with Nokogiri's (flawed) traverse method, the leak disappears. There's nothing obvious in Sanitize's traverse method that seems like it could be leaking memory, but it's possible it could be triggering a leak in Nokogiri. I'll need to dig further. |
This triggers a memory leak in Nokogiri. Fixes #129.
@fabn I think I've fixed this. Can you try out the dev-3.1.2 branch (or install this pre-release gem) and let me know if it fixes the leak for you? |
Thanks for your speedy support. I tried the same code with dev-3.1.2 branch and it's much better. To give you some numbers at the same iteration I reported earlier memory usage went down from 813MB (in 3.1.1) to 167MB with 3.1.2. So your change makes a great improvement. However I still see memory growing during time, so there should be some leak in nokogiri itself. I found for instance this report: sparklemotion/nokogiri#1063 |
Looks like Nokogumbo may be the culprit this time. I can reproduce the leak with this loop, which uses Nokogumbo's HTML5 parser in the same way that require 'nokogumbo'
200_000.times do |i|
doc = Nokogiri::HTML5.parse("<html><body>#{raw_markup}")
frag = doc.fragment
frag << doc.xpath('/html/body/node()')
end However, this loop, which uses Nokogiri's libxml2 parser, doesn't create a leak: require 'nokogumbo'
200_000.times do |i|
doc = Nokogiri::HTML.parse("<html><body>#{raw_markup}")
frag = doc.fragment
frag << doc.xpath('/html/body/node()')
end I'll take a look at Nokogumbo and see if I can spot the problem. |
@fabn My C-fu isn't strong enough to figure out where this is happening in Nokogumbo, but I've filed an upstream issue for this: rubys/nokogumbo#20 I'll go ahead and close this bug now since the original transformer-related leak is resolved. We can use the Nokogumbo bug to discuss the second leak. Thanks again for the reports and the helpful test cases! |
Ok, thanks for your help. I'll follow rubys/nokogumbo#20 Just one more question, will you release 3.1.2 or wait for that issue being solved? |
Just released 3.1.2. Enjoy! |
I'm using your gem to fix a very badly formatted markup. I'm running the cleaning process in a sidekiq worker and after a while process memory grows with no limit. I (think I have) tracked down the leak to
Sanitize.fragment
call. I made a simple script which should reproduce the issue.If i run this code on my local machine at iteration 3260 used memory is 813MB and it keeps growing, while other dumped params don't change across iterations. If I use one of the predefined configurations it doesn't happen 1.
Here is my sanitize config
Here are my current bundle (relevant gems)
Full code including the html sample used is available here: https://gist.github.com/fabn/304691f9bd81e1b814cf
(1): Actually memory usage grows also with
RELAXED
config but very very slowly (i.e. after 4000 iterations used memory is ~70MB while value reported at first iteration was 25MB)The text was updated successfully, but these errors were encountered: