Skip to content

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.

License

Notifications You must be signed in to change notification settings

stratosphereips/collectress

image

Python package Validate-YAML CodeQL Docker GHCR Docker Hub CI

Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.

Features

  • Downloads content from multiple feeds specified in a YAML file
  • Creates a directory for each feed
  • Content stored in a date-structured directory format (YYYY/MM/DD)
  • Handles errors gracefully, allowing the tool to continue even if a single operation fails
  • Command-line arguments for input, output, and cache.
  • Download optimisation through eTag cache.
  • Logs a JSON-formatted comprehensive activity summary per script run

Usage

Collectress can be run from the command line as follows (a log.json will be created upon execution):

python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json

Parameters:

  -h, --help            show this help message and exit
  -e ECACHE, --ecache ECACHE
                        eTag cache for optimizing downloads
  -f FEED, --feed FEED  YAML file containing the feeds
  -w WORKDIR, --workdir WORKDIR
                        The root of the output directory

Usage Docker

Collectress can be used through its Docker image:

docker run --rm \
           -e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \
           -v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \
           -v ${PWD}/log.json:/collectress/log.json \
           -v ${PWD}/etag_cache.json:/collectress/etag_cache.json \
           -v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \
           python collectress.py -f data_feeds.yml -e etag_cache.json -w /data

About

This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.

About

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages