Waper

Waper is a CLI tool to scrape html websites. Here is a simple usage

waper --seed-links "https://example.com/" --whitelist "https://example.com/.*" --whitelist "https://www.iana.org/domains/example"

This will scrape "https://example.com/" and save the html for each link found in a sqlite db with name waper_out.sqlite.

Installation

cargo install waper

CLI Usage

A CLI tool to scrape HTML websites

Usage: waper [OPTIONS]
       waper <COMMAND>

Commands:
  scrape      This is also default command, so it's optional to include in args
  completion  Print shell completion script
  help        Print this message or the help of the given subcommand(s)

Options:
  -w, --whitelist <WHITELIST>
          whitelist regexes: only these urls will be scanned other then seeds
  -b, --blacklist <BLACKLIST>
          blacklist regexes: these urls will never be scanned By default nothing will be blacklisted [default: a^]
  -s, --seed-links <SEED_LINKS>
          Links to start with
  -o, --output-file <OUTPUT_FILE>
          Sqlite output file [default: waper_out.sqlite]
  -m, --max-parallel-requests <MAX_PARALLEL_REQUESTS>
          Sqlite output file [default: 5]
  -i, --include-db-links
          Will also include unprocessed links from `links` table in db if present. Helpful when you want to continue the scraping from a previously unfinished session
  -v, --verbose
          Should verbose (debug) output
  -h, --help
          Print help
  -V, --version
          Print version

Querying data

Data is stored in sqlite db with schema defined in ./sqls/INIT.sql. There are three tables

results: Stores the content of all the request for which a response was recieved
errors: Stores the error message of all the cases where the request could not be completed
links: Stores the urls of both visited or unvisited links

Result can be queried using any sqlite client. Example using sqlite cli:

$ sqlite3 waper_out.sqlite 'select url, time, length(html) from results'
https://example.com/|2023-05-07 06:47:33|1256
https://www.iana.org/domains/example|2023-05-07 06:47:39|80

For beautiful output you can modify sqlite3 settings:

$ sqlite3 waper_out.sqlite '.headers on' '.mode column' 'select url, time, length(html) from results'
url                                   time                 length(html)
------------------------------------  -------------------  ------------
https://example.com/                  2023-05-07 06:47:33  1256
https://www.iana.org/domains/example  2023-05-07 06:47:39  80

To quickly search through all the urls you can use fzf:

sqlite3 waper_out.sqlite 'select url from links' | fzf

Planned improvements

Allow users to specify priority for urls, so some urls can be scraped before others
Support complex rate-limits
Allow continuation of previously stopped scraping
- Should continue working on IP roaming (auto-detect and continue)
Explicitly handling redirect
Allow users to modify part of request (like user-agent)
Improve storage efficiency by compressing/de-duping the html
Provide more visibility into how many urls are queued, at which rate are they getting processed etc
Support JS execution using ... (v8 or webkit, not many options)

Feedback

If you find any bugs or have any feature suggestions please file an issue on github.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
sqls		sqls
src		src
tests		tests
.env		.env
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
rust-toolchain.toml		rust-toolchain.toml
sqlx_schema.sqlite		sqlx_schema.sqlite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Waper

Installation

CLI Usage

Querying data

Planned improvements

Feedback

About

Releases

Packages

Languages

License

nkitsaini/waper

Folders and files

Latest commit

History

Repository files navigation

Waper

Installation

CLI Usage

Querying data

Planned improvements

Feedback

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages