S3 Web Cache

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache returning the S3 version of the page if it exists. If not it gets the url through Requests and archives it in s3.

Our use case: provide a reusable history of pages included in a web scrape. An archived version of a particular URL at a moment in time. Since the web is always changing, different research questions can be asked at a later date, without losing the original content. Please only use in this manner if you have obtained permission for the pages you are requesting.

Quickstart

Install

pip install s3webcache

Usage

from s3webcache import S3WebCache

s3wc = S3WebCache(
    bucket_name=<BUCKET>,
    aws_access_key_id=<AWS_ACCESS_KEY_ID>,
    aws_secret_key=<AWS_SECRET_ACCESS_KEY>,
    aws_default_region=<AWS_DEFAULT_REGION>)

request = s3wc.get("https://en.wikipedia.org/wiki/Whole_Earth_Catalog")

if request.success:
    html = request.message

If the required AWS credentials are not given it will fallback to using environment variables.

The .get(url) operation returns a namedtuple Request: (success: bool, message: str).

For successful operations, .message contains the url data. For unsuccessful operations, .message contains error information.

Options

S3WebCache() takes the following arguments with these defaults:

bucket_name: str
path_prefix: str = None
Subdirectories to store URLs. path_prefix='ht' will start archiving at path s3://BUCKETNAME/ht/
aws_access_key_id: str = None
aws_secret_key: str = None
aws_default_region: str = None
trim_website: bool = False Trim out the hostname. Defaults to storing the hostname as dot replaced underscores. https://github.com/wharton/S3WebCache would be s3://BUCKETNAME/github.com/wharton/S3WebCache.
Set this to true and it will be stored as s3://BUCKETNAME/wharton/S3WebCache.
allow_forwarding: bool = True Will follow HTTP 300 class redirects.

TODO

Add 'update s3 if file is older than...' behavior
Add transparent compression by default (gzip, lz4, etc)
Add rate limiting

Reference

AWS S3 API documentation

License

MIT

Tests

Through Travis-ci

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
s3webcache		s3webcache
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
MANIFEST		MANIFEST
README.md		README.md
config.creds.json.enc		config.creds.json.enc
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S3 Web Cache

Quickstart

Install

Usage

Options

TODO

Reference

License

Tests

About

Releases 1

Packages

Languages

License

wharton/S3WebCache

Folders and files

Latest commit

History

Repository files navigation

S3 Web Cache

Quickstart

Install

Usage

Options

TODO

Reference

License

Tests

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages