Releases: hynky1999/CmonCrawl
Releases · hynky1999/CmonCrawl
1.1.8
1.1.7
What's Changed
- max_requests_per_second cannot be zero otherwise it will break here :… by @spacewaterbear in #100
- Readme consult status page by @hynky1999 in #102
New Contributors
- @spacewaterbear made their first contribution in #100
Full Changelog: 1.1.6...1.1.7
1.1.6
1.1.5
1.1.4
1.1.3
1.1.2
What's Changed
- Ruff Linting and easier development cycle with Makefile by @hynky1999 in #92
- Docs update by @hynky1999 in #93
- Update README.md by @hynky1999 in #94
- 🔥 Removal of cc indexes arg by @hynky1999 in #87
Full Changelog: 1.1.0...1.1.2
1.1.0
Code
- Default throttling for downloaders set to max 300 requests per second.
Downloader
now takes a client for downloading, currently there exists two clients:
- s3 -> Directly queries the common crawl buckets
- api -> Quries CommonCrawl API Gateway
- Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff
CLI
- New global parameter
--aws_profile
for setting an aws_profile to use - New parameter
--download_method
which can be set for
extract...records --download_method
download...html --download_method
In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.
1.0.5
What's Changed
- Athena by @hynky1999 in #86
- remove tests fix downloader by @hynky1999 in #88
- Remove release tests by @hynky1999 in #89
Full Changelog: 1.0.4...1.0.5