Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.
Because it's simple and lightweight. With interface as simple as:
scraper = Parsera()
result = scraper.run(url=url, elements=elements)
- Installation
- Documentation
- Basic usage
- Running with Jupyter Notebook
- Running with CLI
- Running in Docker
pip install parsera
playwright install
Check out documentation to learn more about other features, like running custom models and playwright scripts.
First, set up PARSERA_API_KEY
env variable (If you want to run custom LLM see Custom Models).
You can do this from python with:
import os
os.environ["PARSERA_API_KEY"] = "YOUR_PARSERA_API_KEY_HERE"
Next, you can run a basic version:
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scraper = Parsera()
result = scraper.run(url=url, elements=elements)
result
variable will contain a json with a list of records:
[
{
"Title":"Hacking the largest airline and hotel rewards platform (2023)",
"Points":"104",
"Comments":"24"
},
...
]
There is also arun
async method available:
result = await scrapper.arun(url=url, elements=elements)
Either place this code at the beginning of your notebook:
import nest_asyncio
nest_asyncio.apply()
Or instead of calling run
method use async arun
.
Before you run Parsera
as command line tool don't forget to put your OPENAI_API_KEY
to env variables or .env
file
You can configure elements to parse using JSON string
or FILE
.
Optionally, you can provide FILE
to write output and amount of SCROLLS
, that you want to do on the page
python -m parsera.main URL {--scheme '{"title":"h1"}' | --file FILENAME} [--scrolls SCROLLS] [--output FILENAME]
In case of issues with your local environment you can run Parsera with Docker, see documentation.