Warn on failed requests #74

elacuesta · 2022-03-26T23:09:17Z

Closes #15

In #15, a TimeoutError causes the following:

This line fails (here, to be precise), so the except block closes the page which handled the request to avoid having unused pages consuming memory
Then this line fails with the "Target page, context or browser has been closed" message, because the page was indeed closed

The first exception can be handled with a request errback or a spider middleware with a process_spider_exception, so recovery measures can be taken. The only way I can think of avoiding the second one is not closing the page on failure, I don't think that's a good idea (this avoids having unclosed pages floating around consuming memory) but I'm open to be proven wrong. With the current situation, it's just confusing for the users so let's just catch it and log a message.

This patch is mostly to warn in the logs instead of failing loudly, as the error can be handled with Scrapy's existing API (errbacks, spider middleware's process_spider_exception).

A sample spider:

import scrapy

class ErrbackSpider(scrapy.Spider):
    name = "error"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT": 1,  # short time, will cause a TimeoutError
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={"playwright": True},
            errback=self.errback,
        )

    async def errback(self, failure):
        print("Request:", failure.request)
        print("Exception class:", type(failure.value))
        print("Exception message:", failure.value)

2022-03-29 18:34:33 [scrapy.core.engine] INFO: Spider opened
2022-03-29 18:34:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-29 18:34:35 [scrapy-playwright] INFO: Launching browser
2022-03-29 18:34:35 [scrapy-playwright] INFO: Browser chromium launched
2022-03-29 18:34:39 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://example.org> (<class 'playwright._impl._api_types.TimeoutError'>)
2022-03-29 18:34:39 [scrapy-playwright] WARNING: <Request url='https://example.org/' method='GET'>: failed processing Playwright request (Target page, context or browser has been closed)
Request: <GET https://example.org>
Exception class: <class 'playwright._impl._api_types.TimeoutError'>
Exception message: Timeout 1ms exceeded.
=========================== logs ===========================
navigating to "https://example.org/", waiting until "load"
============================================================
2022-03-29 18:34:39 [scrapy.core.engine] INFO: Closing spider (finished)

codecov · 2022-03-26T23:35:26Z

Codecov Report

Merging #74 (8615eeb) into master (8837603) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 8615eeb differs from pull request most recent head 97d347b. Consider uploading reports for the commit 97d347b to get more accurate results

@@            Coverage Diff            @@
##            master       #74   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          271       280    +9     
=========================================
+ Hits           271       280    +9

Impacted Files	Coverage Δ
scrapy_playwright/handler.py	`100.00% <100.00%> (ø)`

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

elacuesta added 2 commits March 26, 2022 19:54

Handle failed requests

70b1781

Reraise unsafe errors in route.continue_

f2af30d

This was referenced Mar 26, 2022

Many errors with broad crawl #15

Closed

Error: Target page, context or browser has been closed #44

Closed

elacuesta added 6 commits March 26, 2022 22:13

Merge branch 'master' into handle-failed-requests

569b1b9

Test exceptions in route.continue_

5048690

Fix mock in test

618f7a0

Test warning message

cdc4888

Skip test in py37

142a9c3

Remove dangerous emoticon

06bb212

elacuesta marked this pull request as ready for review March 27, 2022 06:49

elacuesta added 4 commits March 29, 2022 18:24

Merge remote-tracking branch 'origin/master' into handle-failed-requests

c1203c6

Update log message

079019f

Merge branch 'master' into handle-failed-requests

8615eeb

Slightly update docstring

97d347b

elacuesta changed the title ~~Handle failed requests~~ Warn on failed requests Apr 15, 2022

elacuesta merged commit f75d48d into master Apr 15, 2022

elacuesta deleted the handle-failed-requests branch April 15, 2022 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn on failed requests #74

Warn on failed requests #74

elacuesta commented Mar 26, 2022 •

edited

Loading

codecov bot commented Mar 26, 2022 •

edited

Loading

Warn on failed requests #74

Warn on failed requests #74

Conversation

elacuesta commented Mar 26, 2022 • edited Loading

codecov bot commented Mar 26, 2022 • edited Loading

Codecov Report

elacuesta commented Mar 26, 2022 •

edited

Loading

codecov bot commented Mar 26, 2022 •

edited

Loading