The goal is to allow using pyppeteer (a python port of puppeteer) from a scrapy spider. This allows to scrape sites that require JS to function properly and to make the scraper more similar to humans.
Current status is experimental, most likely the library will remain in experimental state, with a proper solution included in Scrapy later (which will be very different). Documentation assumes Scrapy knowledge.
Python 3.6+ is required for PEP 525 (Asynchronous Generators).
The library requires async def parse support in scrapy, until this is merged, please install scrapy from a branch:
pip install git+https://github.com/lopuhin/scrapy.git@async-def-parse
Also this requires a pyppeteer fork for some fixes that are not released yet:
pip install git+https://github.com/lopuhin/pyppeteer.git
Finally, install scrapy-pyppeteer itself:
pip install git+https://github.com/lopuhin/scrapy-pyppeteer.git
At the moment, browser management is implemented as a downloader middleware,
which you need to activate (update DOWNLOADER_MIDDLEWARES
in settings):
DOWNLOADER_MIDDLEWARES = { 'scrapy_pyppeteer.ScrapyPyppeteerDownloaderMiddleware': 1000, }
After that you can use scrapy_pyppeteer.BrowserRequest
, and you'll get
scrapy_pyppeteer.BrowserResponse
in your parse
method.
BrowserResponse
has an empty body, and has an extra attribute
browser_window
which is a pyppeteer.page.Page
instance (a browser tab).
If you used BrowserResponse.empty()
, you'll get an empty tab,
and if you specified a URL, then you'll get a tab where page.goto(url)
has been awaited.
To do anything with response.browser_window
, you need to define your
parse callback as async def
, and use await
syntax.
All actions performed via await
are executed directly, without going
to the scheduler (although in the same global event loop). You can also
yield
items and new requests, which will work normally.
Short example of the parse method (see more self-contained examples in "examples" folder of this repo):
async def parse(self, response: BrowserResponse): page = response.browser_tab yield {'url': response.url} for link in await page.querySelectorAll('a'): url = await page.evaluate('link => link.href', link) yield BrowserRequest(url) await page.close()
PYPPETEER_LAUNCH
: a dict with pyppeteer launch options, seepyppeteer.launch
docstring.
- You need to explicitly close the browser tab once you don't need it (e.g. at the end of the parse method).
- Items yielded from a single parse method are kept in memory while the parse method is running, as well as all local variables (the former is less obvious). Yielding a large number of big items from one parse method can increase the memory usage of your spider. Consider splitting work into several other parse methods.
If you wanted to put a pdb.set_trace()
into the spider parse method
and check results of some manipulations with the page which need to be awaited,
this won't work, because pdb
is blocking the event loop. One way which
works is shown below
(although this does not use the spider or this library at all):
import asyncio import pyppeteer loop = asyncio.new_event_loop() run = loop.run_until_complete # use "run(x)" instead of "await x" here browser = run(pyppeteer.launch(headless=False)) # headless=True to see what is executed page = run(browser.newPage()) run(page.goto('http://books.toscrape.com')) print(len(run(page.xpath('//a[@href]')))) # print number
-- this allows to interact with a page from a REPL and observe effects in the browser window.
- Set response status and headers
- A more ergonomic way to close the tab by default?
- More tests
- A way to schedule interactions reusing the same window (to continue working in the same time but go through the scheduler), making sure one tab is used only by one parse method.
- Nice extraction API (like parsel)
- A way to limit max number of tabs open (a bit tricky)