Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on failed requests #74

Merged
merged 12 commits into from
Apr 15, 2022
Merged

Warn on failed requests #74

merged 12 commits into from
Apr 15, 2022

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Mar 26, 2022

Closes #15

In #15, a TimeoutError causes the following:

  • This line fails (here, to be precise), so the except block closes the page which handled the request to avoid having unused pages consuming memory
  • Then this line fails with the "Target page, context or browser has been closed" message, because the page was indeed closed

The first exception can be handled with a request errback or a spider middleware with a process_spider_exception, so recovery measures can be taken. The only way I can think of avoiding the second one is not closing the page on failure, I don't think that's a good idea (this avoids having unclosed pages floating around consuming memory) but I'm open to be proven wrong. With the current situation, it's just confusing for the users so let's just catch it and log a message.

This patch is mostly to warn in the logs instead of failing loudly, as the error can be handled with Scrapy's existing API (errbacks, spider middleware's process_spider_exception).

A sample spider:

import scrapy

class ErrbackSpider(scrapy.Spider):
    name = "error"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT": 1,  # short time, will cause a TimeoutError
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={"playwright": True},
            errback=self.errback,
        )

    async def errback(self, failure):
        print("Request:", failure.request)
        print("Exception class:", type(failure.value))
        print("Exception message:", failure.value)
2022-03-29 18:34:33 [scrapy.core.engine] INFO: Spider opened
2022-03-29 18:34:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-29 18:34:35 [scrapy-playwright] INFO: Launching browser
2022-03-29 18:34:35 [scrapy-playwright] INFO: Browser chromium launched
2022-03-29 18:34:39 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://example.org> (<class 'playwright._impl._api_types.TimeoutError'>)
2022-03-29 18:34:39 [scrapy-playwright] WARNING: <Request url='https://example.org/' method='GET'>: failed processing Playwright request (Target page, context or browser has been closed)
Request: <GET https://example.org>
Exception class: <class 'playwright._impl._api_types.TimeoutError'>
Exception message: Timeout 1ms exceeded.
=========================== logs ===========================
navigating to "https://example.org/", waiting until "load"
============================================================
2022-03-29 18:34:39 [scrapy.core.engine] INFO: Closing spider (finished)

@codecov
Copy link

codecov bot commented Mar 26, 2022

Codecov Report

Merging #74 (8615eeb) into master (8837603) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 8615eeb differs from pull request most recent head 97d347b. Consider uploading reports for the commit 97d347b to get more accurate results

@@            Coverage Diff            @@
##            master       #74   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          271       280    +9     
=========================================
+ Hits           271       280    +9     
Impacted Files Coverage Δ
scrapy_playwright/handler.py 100.00% <100.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@elacuesta elacuesta marked this pull request as ready for review March 27, 2022 06:49
@elacuesta elacuesta changed the title Handle failed requests Warn on failed requests Apr 15, 2022
@elacuesta elacuesta merged commit f75d48d into master Apr 15, 2022
@elacuesta elacuesta deleted the handle-failed-requests branch April 15, 2022 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many errors with broad crawl
1 participant