Skip to content

loorisr/patchright-scrape-api

Repository files navigation

Patchright Scrape API

Simple scraping API based on patchright. It creates a REST API scrape endpoint to return the content of a page.

It runs in docker.

It is inspired from the Typescript version of Firecrawl and it is 100% compatible with it. You just have to replace build: apps/playwright-service-ts by image: loorisr/patchright-scrape-api in your docker-compose

Features:

  • uses https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python instead of playwright
  • better domain blocking handling
  • better media blocking handling
  • scrape multiple pages in parallel
  • scrape endpoint compatible with Firecrawl API
  • return cleaned html and markdown
  • temporary or persistent context
  • can connect to remote browser via CDP
  • lightweight: 1.21 Go / 334 mo without Chrome (for remote CDP connection)

Available on Docker hub: docker pull loorisr/patchright-scrape-api:latest

Env vars

  • DOMAIN_BLOCKED_DOMAINS: list of domains to block. For example ["url1.com", "url2.com"].

  • DOMAIN_BLOCKLIST_URL: url of a domain blocklist. For example: https://raw.githubusercontent.com/hagezi/dns-blocklists/main/domains/light.txt

    It is better to use a small list otherwise it will slow down the page loading time. This light list has already 154 000 entries!

    The best is to block the domain at the DNS level. You can use the lightweight blocky for example

  • DOMAIN_BLOCKLIST_PATH: local path to a domain blocklist. For example blocklist.txt

  • RESOURCES_EXCLUDED: list of type of content to block. [] to disable. Default : ['image', 'stylesheet', 'media', 'font','other']. See https://playwright.dev/python/docs/api/class-request#request-resource-type

  • PROXY_SERVER: adress of the proxy server

  • PROXY_USERNAME: username of the proxy server

  • PROXY_PASSWORD: password of the proxy server

  • PORT: port to run the app. Default: 3000

  • PERSISTENT_CONTEXT: To enable persistent context. If true, a volume needs to be mounted at /context. Default: False

  • REMOTE_CDP: Address of a remote browser with CDP (Chrome DevTools Protocol). Allows you to connect to a provider or use https://github.com/JacobLinCool/playwright-docker for example.

Endpoints

  • /scrape
  • /v1/scrape
    • url: url to scrape : http://www.domain.tld
    • waitFor: time in ms to wait after the page is loaded. Default: 0
    • timeout: time in ms before timeout. Default: 15000
    • headers: Specific headers to add = Default: None
    • formats: List of formats to include in the output: markdown, html, rawHtml : Default : ["markdown"]

About

Simple scraping API based on patchright

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published