Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests not being garbage collected #23

Closed
xanrag opened this issue Aug 17, 2021 · 7 comments
Closed

Requests not being garbage collected #23

xanrag opened this issue Aug 17, 2021 · 7 comments

Comments

@xanrag
Copy link

xanrag commented Aug 17, 2021

I've got a problem where my Request objects are not getting garbage collected but pile up until memory runs out. I've checked this with trackref and objgraph and it looks to me like something in the ScrapyPlaywrightDownloadHandler is keeping a reference to all the Requests? Attached is the objgraph output for the first Request after a few have been handled.

backrefs

@elacuesta
Copy link
Member

elacuesta commented Aug 18, 2021

ScrapyPlaywrightDownloadHandler is very similar to the default HTTPDownloadHandler class regarding the life cycle of Scrapy requests. Requests are kept in memory at least until they are processed by their callbacks (they're available in the response.request attribute).
AFAICT, the only thing that could differ is this additional reference, which if I'm almost certain would be purged when the corresponding Page object is closed or reused for another request (because of this reset). In that case, this would be equivalent to keeping pages open for too long, which should be avoided anyway in order to avoid memory issues.

@xanrag
Copy link
Author

xanrag commented Aug 18, 2021

Hmm, thanks. The amount of HtmlResponse in memory is as expected, so I can't quite pin down what is happening. I do not have the playwright_include_page meta set so that should handle itself. I guess I'll have to start over from scratch with an example spider and see if the problem persists. It looks like there is just one page being created and deleted as well, the page amount in the log never goes above 1.

prefs()
Live References

TestonSeSpider                      1   oldest: 678s ago
HtmlResponse                       37   oldest: 116s ago
Request                           275   oldest: 673s ago
Selector                           37   oldest: 116s ago

prefs()
Live References

TestonSeItem                        6   oldest: 0s ago
TestonSeSpider                      1   oldest: 697s ago
HtmlResponse                        6   oldest: 13s ago
Request                           282   oldest: 692s ago
Selector                           32   oldest: 13s ago

prefs()
Live References

TestonSeSpider                      1   oldest: 1004s ago
HtmlResponse                       13   oldest: 40s ago
Request                           407   oldest: 999s ago
Selector                           13   oldest: 40s ago

@elacuesta
Copy link
Member

Could you try the code from this commit (c4c0bd6) and see if the reference count drops?

@xanrag
Copy link
Author

xanrag commented Aug 19, 2021

Could you try the code from this commit (c4c0bd6) and see if the reference count drops?

That cleared it right up! Thank you! I was going mad trying to debug with the garbage collect interface and trackref but didn't get anywhere, at least I learned a bit more about the inner workings of Python.

Live References

TestonSeItem                       24   oldest: 1s ago
TestonSeSpider                      1   oldest: 11s ago
HtmlResponse                        2   oldest: 3s ago
Request                             4   oldest: 5s ago
Selector                            2   oldest: 3s ago

Live References

TestonSeItem                       72   oldest: 15s ago
TestonSeSpider                      1   oldest: 26s ago
HtmlResponse                        8   oldest: 19s ago
Request                            10   oldest: 21s ago
Selector                            8   oldest: 19s ago

Live References

TestonSeSpider                      1   oldest: 37s ago
HtmlResponse                        4   oldest: 8s ago
Request                             5   oldest: 13s ago
Selector                            3   oldest: 8s ago

Live References

TestonSeItem                       24   oldest: 1s ago
TestonSeSpider                      1   oldest: 285s ago
HtmlResponse                        4   oldest: 10s ago
Request                             6   oldest: 14s ago
Selector                            4   oldest: 10s ago
``

@althayr
Copy link

althayr commented Aug 19, 2021

@elacuesta Thanks for the fix. I've had the same issue for multiple months now, i made a work around by killing the whole scrapy process after 1k pages to avoid OOM. Reading the code it's not obvious to me why passing a scrapy request object as argument to _make_request_handler would make the garbage collector ignore it, do you have any idea why this happens?

@elacuesta
Copy link
Member

Glad to know it worked! I'll be merging this update into the main branch then.
Regarding the why, AFAIK reference count is one of the criteria for determining when to GC objects, so the additional reference to the request might be preventing the reference count to reach zero. My understanding about this part of the language is limited though, so don't take my word for it. Here's an article from the Python Developer’s Guide that explains the process in greater detail.

@elacuesta
Copy link
Member

Released as v0.0.5, thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants