Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom PageMethod callbacks #318

Merged
merged 7 commits into from
Nov 6, 2024

Conversation

jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Sep 12, 2024

Hi @elacuesta, still loving this library! :)

I often find myself having to deal with the Playwright page in my request callback because I need to perform some page actions involving loops or conditionals, which can't currently be done with the playwright_page_methods list. E.g. like this "click the 'load more' button while its visible" logic, mixing parsing with response preparation:

import scrapy
from playwright.async_api import expect


class PageActionSpider(scrapy.Spider):
    name = "pageaction"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        load_button = page.locator(".loadMore")
        loading_overlay = page.locator(".loadingOverlay")
        while (await load_button.is_visible()):
            await load_button.click()
            await expect(loading_overlay).to_be_hidden()
        sel = scrapy.Selector(text=await page.content())
        await page.close()
        print(sel.css(".interestingData").getall())

This PR allows setting a callable instead of a string as PageMethod.method, which will then be called with the page as its first argument, so that all the page-related async actions can again be handled by the download handler and I don't have to worry about closing the page myself or using a custom Selector instead of response.css:

import scrapy
from playwright.async_api import expect
from scrapy_playwright.page import PageMethod


class PageActionSpider(scrapy.Spider):
    name = "pageaction"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod(self.extend_feed),
                ],
            },
        )

    async def extend_feed(self, page):
        load_button = page.locator(".loadMore")
        loading_overlay = page.locator(".loadingOverlay")
        while (await load_button.is_visible()):
            await load_button.click()
            await expect(loading_overlay).to_be_hidden()

    def parse(self, response):
        print(response.css(".interestingData").getall())

@elacuesta
Copy link
Member

Amazing, thank you for the contribution @jdemaeyer 😄

I've added a simple test, I'll also mention it in the docs shortly.

@elacuesta elacuesta merged commit 5500a6e into scrapy-plugins:main Nov 6, 2024
12 checks passed
@elacuesta
Copy link
Member

Thank you @jdemaeyer!

@jdemaeyer
Copy link
Contributor Author

No thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants