A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.
Useful for:
- Warming site caches
- Checking response times
- Identifying dead or broken pages
#Linux (Debian/Ubuntu) & MacOS
$ go build -o crab cmd/crab/main.go
#Windows
$ go build -o crab.exe cmd/crab/main.go
$ docker pull atomicptr/crab
# Example
$ docker run --rm atomicptr/crab --help
$ docker run --rm atomicptr/crab crawl:sitemap https://domain.com/sitemap.xml
Not available in nixpkgs but I have my own nix repository which you can use:
let
atomicptr = import (fetchTarball "https://github.com/atomicptr/nix/archive/refs/heads/master.tar.gz") {};
in
{
environment.systemPackages = with pkgs; [
atomicptr.crab
];
}
$ brew install atomictr/tools/crab
$ scoop bucket add atomicptr https://github.com/atomicptr/scoop-bucket
$ scoop install crab
Crawl singular URLs:
$ crab crawl https://domain.com https://domain.com/test
{"status": 200, "url": "https://domain.com", ...}
...
Crawl through a sitemap:
$ crab crawl:sitemap https://domain.com/sitemap.xml
Replace all URLs with a different one:
$ crab crawl:sitemap https://domain.com/sitemap.xml --prefix-url=https://staging.domain.com
Add some cookies/headers:
$ crab crawl:sitemap https://domain.com/sitemap.xml --cookie auth_token=12345 --header X-Bypass-Cache=1
You can filter the output by it's status code
# This will only return responses with a 200 OK
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200
# This will only return responses that are not OK
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=!200
# This will only return responses between 500-599 (range)
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=500-599
# This will only return responses with 200 or 404 (multiple, be aware if one condition is true they all are)
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200,404
# This will only return responses with a code greater than 500
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=>500
You can save the url list to a file
# This will save the output to a file called output.txt
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-file ./output/output.txt
You can save the output to a JSON file
# This will save the output to a file called output.json
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-json ./output/output.json