Scraper

This is a small scraper, aimed at providing an offline copy of a website, that looks at the provided sitemaps and crawls the individual pages in the sitemap, instead of trying to figure out what needs to be crawled next by inspecting the html.

Output is saved by default in the /output directory, where it tries to mimic the same structure as the sitemaps that it parsed.

To preserve the same URL structure as the site that's crawled, /some-page/index.html is created for each of the pages, with page assets being stored in /some-page/assets.

If you do want to serve this up, as e.g. a failover site, ensure that your webserver serves up index.html in a directory by default. For apache, this would be:

DirectoryIndex index.html

It's been tested with a Webflow website.

To get started

cp .env.example .env

Change the list of sitemaps to crawl by setting SITEMAP_URLS in .env

make up
make in
python go.py

Crawled pages should be available in the /output directory.

You should be able to view the site by going to:

http://localhost

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.env		.env
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.py		go.py
requirements.txt		requirements.txt
saveFullHtmlPage.py		saveFullHtmlPage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper

To get started

About

Releases

Packages

Languages

mmmgtc/webflow-scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

To get started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages