About

Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.

Features

Downloads content from multiple feeds specified in a YAML file
Creates a directory for each feed
Content stored in a date-structured directory format (YYYY/MM/DD)
Handles errors gracefully, allowing the tool to continue even if a single operation fails
Command-line arguments for input, output, and cache.
Download optimisation through eTag cache.
Logs a JSON-formatted comprehensive activity summary per script run

Usage

Collectress can be run from the command line as follows (a log.json will be created upon execution):

python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json

Parameters:

  -h, --help            show this help message and exit
  -e ECACHE, --ecache ECACHE
                        eTag cache for optimizing downloads
  -f FEED, --feed FEED  YAML file containing the feeds
  -w WORKDIR, --workdir WORKDIR
                        The root of the output directory

Usage Docker

Collectress can be used through its Docker image:

docker run --rm \
           -e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \
           -v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \
           -v ${PWD}/log.json:/collectress/log.json \
           -v ${PWD}/etag_cache.json:/collectress/etag_cache.json \
           -v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \
           python collectress.py -f data_feeds.yml -e etag_cache.json -w /data

About

This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
docs		docs
lib		lib
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
collectress.py		collectress.py
data_feeds_EXAMPLE.yml		data_feeds_EXAMPLE.yml
etag_cache.json		etag_cache.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Usage

Usage Docker

About

About

Releases 3

Packages

Languages

License

stratosphereips/collectress

Folders and files

Latest commit

History

Repository files navigation

Features

Usage

Usage Docker

About

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages