Requests not being garbage collected #23

xanrag · 2021-08-17T23:13:38Z

I've got a problem where my Request objects are not getting garbage collected but pile up until memory runs out. I've checked this with trackref and objgraph and it looks to me like something in the ScrapyPlaywrightDownloadHandler is keeping a reference to all the Requests? Attached is the objgraph output for the first Request after a few have been handled.

elacuesta · 2021-08-18T18:39:55Z

ScrapyPlaywrightDownloadHandler is very similar to the default HTTPDownloadHandler class regarding the life cycle of Scrapy requests. Requests are kept in memory at least until they are processed by their callbacks (they're available in the response.request attribute).
AFAICT, the only thing that could differ is this additional reference, which if I'm almost certain would be purged when the corresponding Page object is closed or reused for another request (because of this reset). In that case, this would be equivalent to keeping pages open for too long, which should be avoided anyway in order to avoid memory issues.

xanrag · 2021-08-18T18:55:57Z

Hmm, thanks. The amount of HtmlResponse in memory is as expected, so I can't quite pin down what is happening. I do not have the playwright_include_page meta set so that should handle itself. I guess I'll have to start over from scratch with an example spider and see if the problem persists. It looks like there is just one page being created and deleted as well, the page amount in the log never goes above 1.

prefs()
Live References

TestonSeSpider                      1   oldest: 678s ago
HtmlResponse                       37   oldest: 116s ago
Request                           275   oldest: 673s ago
Selector                           37   oldest: 116s ago

prefs()
Live References

TestonSeItem                        6   oldest: 0s ago
TestonSeSpider                      1   oldest: 697s ago
HtmlResponse                        6   oldest: 13s ago
Request                           282   oldest: 692s ago
Selector                           32   oldest: 13s ago

prefs()
Live References

TestonSeSpider                      1   oldest: 1004s ago
HtmlResponse                       13   oldest: 40s ago
Request                           407   oldest: 999s ago
Selector                           13   oldest: 40s ago

elacuesta · 2021-08-19T17:05:10Z

Could you try the code from this commit (c4c0bd6) and see if the reference count drops?

xanrag · 2021-08-19T19:19:16Z

Could you try the code from this commit (c4c0bd6) and see if the reference count drops?

That cleared it right up! Thank you! I was going mad trying to debug with the garbage collect interface and trackref but didn't get anywhere, at least I learned a bit more about the inner workings of Python.

Live References

TestonSeItem                       24   oldest: 1s ago
TestonSeSpider                      1   oldest: 11s ago
HtmlResponse                        2   oldest: 3s ago
Request                             4   oldest: 5s ago
Selector                            2   oldest: 3s ago

Live References

TestonSeItem                       72   oldest: 15s ago
TestonSeSpider                      1   oldest: 26s ago
HtmlResponse                        8   oldest: 19s ago
Request                            10   oldest: 21s ago
Selector                            8   oldest: 19s ago

Live References

TestonSeSpider                      1   oldest: 37s ago
HtmlResponse                        4   oldest: 8s ago
Request                             5   oldest: 13s ago
Selector                            3   oldest: 8s ago

Live References

TestonSeItem                       24   oldest: 1s ago
TestonSeSpider                      1   oldest: 285s ago
HtmlResponse                        4   oldest: 10s ago
Request                             6   oldest: 14s ago
Selector                            4   oldest: 10s ago
``

althayr · 2021-08-19T22:07:23Z

@elacuesta Thanks for the fix. I've had the same issue for multiple months now, i made a work around by killing the whole scrapy process after 1k pages to avoid OOM. Reading the code it's not obvious to me why passing a scrapy request object as argument to _make_request_handler would make the garbage collector ignore it, do you have any idea why this happens?

elacuesta · 2021-08-20T15:19:07Z

Glad to know it worked! I'll be merging this update into the main branch then.
Regarding the why, AFAIK reference count is one of the criteria for determining when to GC objects, so the additional reference to the request might be preventing the reference count to reach zero. My understanding about this part of the language is limited though, so don't take my word for it. Here's an article from the Python Developer’s Guide that explains the process in greater detail.

elacuesta · 2021-08-20T21:43:13Z

Released as v0.0.5, thanks for the report!

elacuesta closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests not being garbage collected #23

Requests not being garbage collected #23

xanrag commented Aug 17, 2021

elacuesta commented Aug 18, 2021 •

edited

Loading

xanrag commented Aug 18, 2021

elacuesta commented Aug 19, 2021

xanrag commented Aug 19, 2021 •

edited

Loading

althayr commented Aug 19, 2021

elacuesta commented Aug 20, 2021

elacuesta commented Aug 20, 2021

Requests not being garbage collected #23

Requests not being garbage collected #23

Comments

xanrag commented Aug 17, 2021

elacuesta commented Aug 18, 2021 • edited Loading

xanrag commented Aug 18, 2021

elacuesta commented Aug 19, 2021

xanrag commented Aug 19, 2021 • edited Loading

althayr commented Aug 19, 2021

elacuesta commented Aug 20, 2021

elacuesta commented Aug 20, 2021

elacuesta commented Aug 18, 2021 •

edited

Loading

xanrag commented Aug 19, 2021 •

edited

Loading