Skip to content

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache serving the S3 version of the page if it exists. If not it writes a version to s3.

License

Notifications You must be signed in to change notification settings

wharton/S3WebCache

Repository files navigation

Build Status PyPI version License: MIT

S3 Web Cache

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache returning the S3 version of the page if it exists. If not it gets the url through Requests and archives it in s3.

Our use case: provide a reusable history of pages included in a web scrape. An archived version of a particular URL at a moment in time. Since the web is always changing, different research questions can be asked at a later date, without losing the original content. Please only use in this manner if you have obtained permission for the pages you are requesting.

Quickstart

Install

pip install s3webcache

Usage

from s3webcache import S3WebCache

s3wc = S3WebCache(
    bucket_name=<BUCKET>,
    aws_access_key_id=<AWS_ACCESS_KEY_ID>,
    aws_secret_key=<AWS_SECRET_ACCESS_KEY>,
    aws_default_region=<AWS_DEFAULT_REGION>)

request = s3wc.get("https://en.wikipedia.org/wiki/Whole_Earth_Catalog")

if request.success:
    html = request.message

If the required AWS credentials are not given it will fallback to using environment variables.

The .get(url) operation returns a namedtuple Request: (success: bool, message: str).

For successful operations, .message contains the url data. For unsuccessful operations, .message contains error information.

Options

S3WebCache() takes the following arguments with these defaults:

  • bucket_name: str
  • path_prefix: str = None
    Subdirectories to store URLs. path_prefix='ht' will start archiving at path s3://BUCKETNAME/ht/
  • aws_access_key_id: str = None
  • aws_secret_key: str = None
  • aws_default_region: str = None
  • trim_website: bool = False Trim out the hostname. Defaults to storing the hostname as dot replaced underscores. https://github.com/wharton/S3WebCache would be s3://BUCKETNAME/github.com/wharton/S3WebCache.
    Set this to true and it will be stored as s3://BUCKETNAME/wharton/S3WebCache.
  • allow_forwarding: bool = True Will follow HTTP 300 class redirects.

TODO

  • Add 'update s3 if file is older than...' behavior
  • Add transparent compression by default (gzip, lz4, etc)
  • Add rate limiting

Reference

AWS S3 API documentation

License

MIT

Tests

Through Travis-ci

About

This is a simple package for archiving web pages (HTML) to S3. It acts as a cache serving the S3 version of the page if it exists. If not it writes a version to s3.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages