Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set playwright_page request meta key early #91

Merged
merged 1 commit into from
May 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,8 +239,10 @@ class AwesomeSpiderWithPage(scrapy.Spider):
* In order to avoid memory issues, it is recommended to manually close the page
by awaiting the `Page.close` coroutine.
* Be careful about leaving pages unclosed, as they count towards the limit set by
`PLAYWRIGHT_MAX_PAGES_PER_CONTEXT`. It's recommended to set a Request errback to
make sure pages are closed even if a request fails.
`PLAYWRIGHT_MAX_PAGES_PER_CONTEXT`. When passing `playwright_include_page=True`,
it's recommended to set a Request errback to make sure pages are closed even
if a request fails (if `playwright_include_page=False` or unset, pages are
automatically closed upon encountering an exception).
* Any network operations resulting from awaiting a coroutine on a `Page` object
(`goto`, `go_back`, etc) will be executed directly by Playwright, bypassing the
Scrapy request workflow (Scheduler, Middlewares, etc).
Expand Down
10 changes: 5 additions & 5 deletions scrapy_playwright/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,17 +211,17 @@ async def _download_request(self, request: Request, spider: Spider) -> Response:
return result

async def _download_request_with_page(self, request: Request, page: Page) -> Response:
# set this early to make it available in errbacks even if something fails
if request.meta.get("playwright_include_page"):
request.meta["playwright_page"] = page

start_time = time()
response = await page.goto(request.url)

await self._apply_page_methods(page, request)

body_str = await page.content()
request.meta["download_latency"] = time() - start_time

if request.meta.get("playwright_include_page"):
request.meta["playwright_page"] = page
else:
if not request.meta.get("playwright_include_page"):
await page.close()
self.stats.inc_value("playwright/page_count/closed")

Expand Down