Skip to content
Tim Docker edited this page Feb 21, 2017 · 5 revisions

Overview

The S3 Tree Store (S3TS) software is a simple system for managing trees of related files stored efficiently in Amazon S3. They are stored in a structure designed to support differential, incremental uploads and downloads, with compression and data deduplification. From a users perspective, however, it just stores named trees of files.

The S3TS implementation is written in python, and is available both as a library, and command line tool.

S3TS Command Line Interface

The CLI is packaged as a single zip archive of python code. There is no need to unpack it, it can be passed as a argument directly to the python interpreter.

The CLI has a variety of subcommands:

:::text
$ python s3ts.zip --help
usage: main.py [-h]
               {init,list,remove,rename,info,upload,download,flush,flush-cache,install,verify-install,presign,download-http,install-http,prime-cache,upload-many,validate-local-cache}
               ...

positional arguments:
  {init,list,remove,rename,info,upload,download,flush,flush-cache,install,verify-install,presign,download-http,install-http,prime-cache,upload-many,validate-local-cache}
                        commands
    init                Initialise a new store
    list                List trees available in the store
    remove              Remove a tree from the store
    rename              Rename an existing tree in the sore
    info                Show information about a tree
    upload              Upload a tree from the local filesystem
    download            Download a tree to the local cache
    flush               Flush chunks from the store that are no longer
                        referenced
    flush-cache         Flush chunks from the local cache that are not
                        referenced by the specified packages
    install             Download/Install a tree into the filesystem
    verify-install      Confirm a tree has been correctly installed
    presign             Generate a package definition containing presigned
                        urls
    download-http       Download a tree to the local cache using a presigned
                        package file
    install-http        Install a tree from local cache using a presigned
                        package file
    prime-cache         Prime the local cache with the contents of a local
                        directory
    upload-many         Upload multiple trees from the local filesystem
    compare-packages    Compare two packages
    validate-local-cache
                        Validates the local cache

optional arguments:
  -h, --help            show this help message and exit
$ 

Most of the subcommands rely on environment variables to configure access to resources. The necessary environment variables are:

Variable Description
S3TS_LOCALCACHE The local directory used to cache downloaded files
AWS_ACCESS_KEY_ID The AWS access key
AWS_SECRET_ACCESS_KEY The AWS secret for the key
S3TS_BUCKET The name of the S3 bucket used for storage

list

The list subcommand shows the names of the packages currently stored:

$ python s3ts.zip list
v1.0
v1.1
$ 

info

The info subcommand shows detailed information about a single package:

$ python s3ts.zip info v1.1
Package: v1.1
Created At: 2015-04-13T11:42:23.862001
Total Size: 332 bytes
Files:
    assets/car-01.db (3 chunks, 230 bytes)
    code/file1.py (1 chunks, 46 bytes)
    code/file3.py (1 chunks, 56 bytes)
$ 

upload

The upload subcommand makes a named copy of a given local directory on S3. It only uploads files that are not already present. In the example below, we are uploading the contents of the /tmp/test/src-1 directory to S3 and calling it junk-1.0

$ python s3ts.zip upload junk-1.0 /tmp/test/src-1
0+322 bytes transferred
$ 

download

The download subcommand copies from S3 into the local cache all data required to install the specified package. Only data not already present will be downloaded.

$ python s3ts.zip download junk-1.0
322+0 bytes transferred
$ 

install

The install subcommand installs the specified package into the specified local directory. Data not present in the local cache will be downloaded from S3.

$ python s3ts.zip install junk-1.0
322+0 bytes transferred
$ 

prime-cache

The S3TS system manages a local cache of files to avoid unnecessary downloads. The prime-cache subcommand can be used to manually populate this local cache from directory that is visible locally:

$ python s3ts.zip prime-cache /tmp/test/src-2
0+332 bytes transferred
$ 

verify-install

The verify-install command confirms that the contents of the directory tree (presumably populated with install) still matches the package definition. The command will succeed if all files in the package are present. If additional non-package files are also present, they are listed when the --verbose option is provided.

$ python s3ts.zip verify-install /tmp/test/src-3

flush

Packages uploaded to the store are split into chunks, and stored according to the hashcode. These chunks are persistent, and if packages are deleted or updated chunks make be present in the store even though they are no longer required. The flush command removes all chunks that are not referenced by any package. Options --dry-run and --verbose may be provided to discover how much data will be removed.

python s3ts.zip flush --verbose

flush-cache

Machines that download packages do so by fetching each chunk to a local cache. The cache is reused, making subsequent downloads much faster. However, over time the contents of this cache will grow to contain chunks that are no longer required. The flush-cache command removes unnecessary chunks from the local cache. The local cache doesn't contain package details, so it is necessary to provide the list of package names which much be preserved:

python s3ts.zip flush-cache junk-1.0 junk-1.2

The command accepts --verbose and --dry-run commands making it possible to see how much data will be removed.

compare-packages

This command fetches the metadata for two packages, and summarizes the differences between them. This is useful for estimating the total size of a change. All sizes shown are in bytes. For example:

$ python s3ts.zip compare-packages DDS-v2.1.6a DDS-v2.1.6b
Fetching DDS-v2.1.6a...
Fetching DDS-v2.1.6b...
---
Updated config/base_constants.js (size 3,318)
Updated thirdparty/chrome-win32/debug.log (size 122)
Updated data/car_May16.xml (size 65,261)
Updated html/sr360/static/js/lib/jquery.ml-keyboard.js (size 17,150)

DDS-v2.1.6a size = 58,113,495,802
DDS-v2.1.6b size = 58,113,495,730
update size = 85,851

new-metapackage

The new-metapackage subcommand creates a local template for a metapackage (It doesn't actually push a metapackage to the store). Running the command

$ python s3ts.zip new-metapackage release-x.json

will write a package template to the file release-x.json. It is expected that the user will edit this to correctly specify a new package, and then run upload-metapackage to upload the metapackage to s3. The json structure is:

{
  "name": "METANAME-VERSION", 
  "description": "", 
  "creationTime": "2017-02-21T14:53:00.332978", 
  "components": [
    {
      "subPackage": {
        "installPath": "SUBDIR1", 
        "packageName": "PACKAGE1-VERSION"
      }
    }
  ]
}

The user is expected to fill in:

  • METANAME-VERSION : the full name of the metapackage
  • SUBDIR1 : the subdirectory into which the first subpackage is to be installed
  • PACKAGE1-VERSION : the full name of the first subpackage

etc. Subpackages can be added or removed as required. Additionally, a special type of subpackage can be specified to support localization. One can write:

      "localizedPackage": {
        "installPath": "SUBDIR1",
        "localizedPackageName": "regional-info-{KIOSK_REGION}",
        "defaultPackageName": "regional-info-default"
      } 

which specifies that the package to be installed is determined by the region the kiosk (eg if KIOSK_REGION is NSW, then we would install region-info-NSW). If the localized package is not available then the default package will be installed.

upload-metapackage

The upload-metapackage reads a metapackage description from an existing json file, and uploads it to the store. For example:

$ python s3ts.zip upload-metapackage release-x.json

download-metapackage

The download-metapackage command fetches a named metapackage from the store, and writes it to a local json file. For example:

$ python s3ts.zip download-metapackage release-x release-x.json

upload-many

The upload subcommand (described above) creates a single named tree on S3. The upload-many command is used to create a collection of named trees. These trees will generally contains mostly common files, but with certain files that need to vary.

Running this command:

$ python s3ts.zip upload-many dds-1.1.7 variants common

will upload one tree for each subdirectory in variants. Each tree will contain the specific files from variants and all of the files from common.

To be more specific, running this command:

$ python upload-many dds-1.1.7 c:\src-1.1.7-kiosks c:\src-1.1.7

with a directory tree:

c:\src-1.1.7\data\...
c:\src-1.1.7\code\...
c:\src-1.1.7-kiosks\S03-2067-P-001\keys\...
c:\src-1.1.7-kiosks\S02-2011-P-001\keys\...

will result in two tress being created on S3: dds-1.1.7:S03-2067-P-001 and dds-1.1.7:S02-2011-P-001. Each of these will contain the same data and code directories, but a specific keys directory.

Internal Structure

Within S3 files are broken into fixed sized chunks, and stored indexed under the hash of their contents. The files are compressed when beneficial. An index file stores the definition of the files making up a package. This has the following json representation:

{
    "name": "junk-1.0",
    "creationTime": "2015-04-24T11:14:41.120851",
    "files": [
        {
            "chunks": [
                {
                    "encoding": "zlib",
                    "sha1": "a3084ba0d54df1557b8868ee86f7a7a3248fd26a",
                    "size": 100
                },
                {
                    "encoding": "zlib",
                    "sha1": "1aabeb499f883d83ba6be493048c9aaf1d7bd138",
                    "size": 100
                },
                {
                    "encoding": "raw",
                    "sha1": "3d9ee170a58d56596f1eba74a671093ecb665480",
                    "size": 30
                }
            ],
            "path": "assets/car-01.db",
            "sha1": "55af846da876a32091d08bd30b8bcb8e678899f6"
        },
        {
            "chunks": [
                {
                    "encoding": "raw",
                    "sha1": "54b52705acd64215a41eacd58bf1703f9d105db1",
                    "size": 45
                }
            ],
            "path": "code/file1.py",
            "sha1": "54b52705acd64215a41eacd58bf1703f9d105db1"
        },
        {
            "chunks": [
                {
                    "encoding": "raw",
                    "sha1": "5ab45f0437d7ae582e94cb7def459ea6a377a6c3",
                    "size": 47
                }
            ],
            "path": "code/file2.py",
            "sha1": "5ab45f0437d7ae582e94cb7def459ea6a377a6c3"
        }
    ]
}

within the S3 bucket, the data is stored as follows:

  • compressed chunks are stored as chunks/zlib/XX/XXXXXXXXXXXXXXXXXX where X.. are the digits of the SHA1 hash of the (uncompressed) chunk
  • raw chunks are stored as chunks/raw/XX/XXXXXXXXXXXXXXXXXX where X.. are the digits of the SHA1 hash of the chunk
  • The package definition files are stored as json blobs (with the structure shown above), as trees/NAME
  • Configuration settings for the store are saved as config. This is a json blob specifying whether compression is enabled for this store, and the chunk size.