Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.
- Downloads content from multiple feeds specified in a YAML file
- Creates a directory for each feed
- Content stored in a date-structured directory format (YYYY/MM/DD)
- Handles errors gracefully, allowing the tool to continue even if a single operation fails
- Command-line arguments for input, output, and cache.
- Download optimisation through eTag cache.
- Logs a JSON-formatted comprehensive activity summary per script run
Collectress can be run from the command line as follows (a log.json
will be created upon execution):
python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json
Parameters:
-h, --help show this help message and exit
-e ECACHE, --ecache ECACHE
eTag cache for optimizing downloads
-f FEED, --feed FEED YAML file containing the feeds
-w WORKDIR, --workdir WORKDIR
The root of the output directory
Collectress can be used through its Docker image:
docker run --rm \
-e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \
-v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \
-v ${PWD}/log.json:/collectress/log.json \
-v ${PWD}/etag_cache.json:/collectress/etag_cache.json \
-v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \
python collectress.py -f data_feeds.yml -e etag_cache.json -w /data
This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.