docs

Feb 17, 2025

18a39c7 · Feb 17, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets	Update api-banner.png	Feb 17, 2025
source	source	Merge branch 'main' into 829-languagecountry-selection	Jan 21, 2025
Makefile	Makefile	add: readthedocs structure	Jan 31, 2024
README.md	README.md	run pre commit	Jan 12, 2025
chinese.md	chinese.md	docs: Updated the graph_config in the documentation.	Sep 12, 2024
japanese.md	japanese.md	docs: Updated the graph_config in the documentation.	Sep 12, 2024
korean.md	korean.md	docs: Updated the graph_config in the documentation.	Sep 12, 2024
make.bat	make.bat	add: readthedocs structure	Jan 31, 2024
requirements-dev.txt	requirements-dev.txt	add read the docs	Jan 17, 2025
requirements.txt	requirements.txt	add furo	Jan 17, 2025
russian.md	russian.md	run pre commit	Jan 12, 2025
turkish.md	turkish.md	chore: made some libs optional	Jan 6, 2025

README.md

title

markmap

ScrapGraphAI Roadmap

colorFreezeLevel	maxWidth
2	500

ScrapGraphAI Roadmap

Short-Term Goals

Improve the documentation (ReadTheDocs)
- Issue #102
Create tutorials for the library

Medium-Term Goals

Node for handling API requests
Make scraping more deterministic
- Create DOM tree of the website
- HTML tag text embeddings with tags metadata
- Study tree forks from root node
- How do we use the tags parameters?
Create scraping folder with report
- Folder contains .scrape files, DOM tree files, report
- Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
- We can use pyecharts with R-markdown
Scrape multiple pages of the same website
- Create new node that instantiate multiple graphs at the same time
- Make graphs run in parallel
- Scrape only relevant URLs from user prompt
- Use the multi dimensional DOM tree of the website for retrieval
Crawler graph
- Scrape all the URLs with the same domain in all the pages
- Build many DOM trees and link them together
- Save the multi dimensional tree in a file
Compare two DOM trees to assess the similarity
- Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
- Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)
Nodes for handling authentication
- Use Selenium or Playwright to handle authentication
- Passes the cookies to the other nodes
Nodes that attaches to an open browser
- Use Selenium or Playwright to attach to an open browser
- Navigate inside the browser and scrape the content
Nodes for taking screenshots and understanding the page layout
- Use Selenium or Playwright to take screenshots
- Use LLM to asses if it is a block-like page, paragraph-like page, etc.
- Issue #88

Long-Term Goals

Automatic generation of scraping pipelines from a given prompt
Create API for the library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Files

docs

docs

README.md

ScrapGraphAI Roadmap

Short-Term Goals

Medium-Term Goals

Long-Term Goals

Files

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

ScrapGraphAI Roadmap

Short-Term Goals

Medium-Term Goals

Long-Term Goals