-
Notifications
You must be signed in to change notification settings - Fork 3
Home
The S3 Tree Store (S3TS) software is a simple system for managing trees of related files stored efficiently in Amazon S3. They are stored in a structure designed to support differential, incremental uploads and downloads, with compression and data deduplification. From a users perspective, however, it just stores named trees of files.
The S3TS implementation is written in python, and is available both as a library, and command line tool.
The CLI is packaged as a single zip archive of python code. There is no need to unpack it, it can be passed as a argument directly to the python interpreter.
The CLI has a variety of subcommands:
:::text
$ python s3ts.zip --help
usage: main.py [-h]
{init,list,remove,rename,info,upload,download,flush,flush-cache,install,verify-install,presign,download-http,install-http,prime-cache,upload-many,validate-local-cache}
...
positional arguments:
{init,list,remove,rename,info,upload,download,flush,flush-cache,install,verify-install,presign,download-http,install-http,prime-cache,upload-many,validate-local-cache}
commands
init Initialise a new store
list List trees available in the store
remove Remove a tree from the store
rename Rename an existing tree in the sore
info Show information about a tree
upload Upload a tree from the local filesystem
download Download a tree to the local cache
flush Flush chunks from the store that are no longer
referenced
flush-cache Flush chunks from the local cache that are not
referenced by the specified packages
install Download/Install a tree into the filesystem
verify-install Confirm a tree has been correctly installed
presign Generate a package definition containing presigned
urls
download-http Download a tree to the local cache using a presigned
package file
install-http Install a tree from local cache using a presigned
package file
prime-cache Prime the local cache with the contents of a local
directory
upload-many Upload multiple trees from the local filesystem
compare-packages Compare two packages
validate-local-cache
Validates the local cache
optional arguments:
-h, --help show this help message and exit
$
Most of the subcommands rely on environment variables to configure access to resources. The necessary environment variables are:
Variable | Description |
---|---|
S3TS_LOCALCACHE | The local directory used to cache downloaded files |
AWS_ACCESS_KEY_ID | The AWS access key |
AWS_SECRET_ACCESS_KEY | The AWS secret for the key |
S3TS_BUCKET | The name of the S3 bucket used for storage |
The list
subcommand shows the names of the packages currently stored:
$ python s3ts.zip list
v1.0
v1.1
$
The info
subcommand shows detailed information about a single package:
$ python s3ts.zip info v1.1
Package: v1.1
Created At: 2015-04-13T11:42:23.862001
Total Size: 332 bytes
Files:
assets/car-01.db (3 chunks, 230 bytes)
code/file1.py (1 chunks, 46 bytes)
code/file3.py (1 chunks, 56 bytes)
$
The upload
subcommand makes a named copy of a given local directory on S3. It only uploads files that are not already present. In the example below, we are uploading the contents of the /tmp/test/src-1
directory to S3 and calling it junk-1.0
$ python s3ts.zip upload junk-1.0 /tmp/test/src-1
0+322 bytes transferred
$
The download
subcommand copies from S3 into the local cache all data required to install the specified package. Only data not already present will be downloaded.
$ python s3ts.zip download junk-1.0
322+0 bytes transferred
$
The install
subcommand installs the specified package into the specified local directory. Data not present in the local cache will be downloaded from S3.
$ python s3ts.zip install junk-1.0
322+0 bytes transferred
$
The S3TS system manages a local cache of files to avoid unnecessary downloads. The prime-cache
subcommand can be used to manually populate this local cache from directory that is visible locally:
$ python s3ts.zip prime-cache /tmp/test/src-2
0+332 bytes transferred
$
The verify-install
command confirms that the contents of the directory tree (presumably populated with install
) still matches the package definition. The command will succeed if all files in the package are present. If additional non-package files are also present, they are listed when the --verbose
option is provided.
$ python s3ts.zip verify-install /tmp/test/src-3
Packages uploaded to the store are split into chunks, and stored according to the hashcode. These chunks are persistent, and if packages are deleted or updated chunks make be present in the store even though they are no longer required. The flush
command removes all chunks that are not referenced by any package. Options --dry-run
and --verbose
may be provided to discover how much data will be removed.
python s3ts.zip flush --verbose
Machines that download packages do so by fetching each chunk to a local cache. The cache is reused, making subsequent downloads much faster. However, over time the contents of this cache will grow to contain chunks that are no longer required. The flush-cache
command removes unnecessary chunks from the local cache. The local cache doesn't contain package details, so it is necessary to provide the list of package names which much be preserved:
python s3ts.zip flush-cache junk-1.0 junk-1.2
The command accepts --verbose
and --dry-run
commands making it possible to see how much data will be removed.
This command fetches the metadata for two packages, and summarizes the differences between them. This is useful for estimating the total size of a change. All sizes shown are in bytes. For example:
$ python s3ts.zip compare-packages DDS-v2.1.6a DDS-v2.1.6b
Fetching DDS-v2.1.6a...
Fetching DDS-v2.1.6b...
---
Updated config/base_constants.js (size 3,318)
Updated thirdparty/chrome-win32/debug.log (size 122)
Updated data/car_May16.xml (size 65,261)
Updated html/sr360/static/js/lib/jquery.ml-keyboard.js (size 17,150)
DDS-v2.1.6a size = 58,113,495,802
DDS-v2.1.6b size = 58,113,495,730
update size = 85,851
The new-metapackage
subcommand creates a local template for a metapackage (It doesn't actually push a metapackage to the store). Running the command
$ python s3ts.zip new-metapackage release-x.json
will write a package template to the file release-x.json
. It is expected that the user will edit this to correctly specify a new package, and then run upload-metapackage
to upload the metapackage to s3. The json structure is:
{
"name": "METANAME-VERSION",
"description": "",
"creationTime": "2017-02-21T14:53:00.332978",
"components": [
{
"subPackage": {
"installPath": "SUBDIR1",
"packageName": "PACKAGE1-VERSION"
}
}
]
}
The user is expected to fill in:
-
METANAME-VERSION
: the full name of the metapackage -
SUBDIR1
: the subdirectory into which the first subpackage is to be installed -
PACKAGE1-VERSION
: the full name of the first subpackage
etc. Subpackages can be added or removed as required. Additionally, a special type of subpackage can be specified to support localization. One can write:
"localizedPackage": {
"installPath": "SUBDIR1",
"localizedPackageName": "regional-info-{KIOSK_REGION}",
"defaultPackageName": "regional-info-default"
}
which specifies that the package to be installed is determined by the region the kiosk (eg if KIOSK_REGION is NSW, then we would install region-info-NSW
). If the localized package is not available then the default package will be installed.
The upload-metapackage
reads a metapackage description from an existing json file, and uploads it to the store. For example:
$ python s3ts.zip upload-metapackage release-x.json
The download-metapackage
command fetches a named metapackage from the store, and writes it to a local json file. For example:
$ python s3ts.zip download-metapackage release-x release-x.json
The upload
subcommand (described above) creates a single named tree on S3. The upload-many
command is used to create a collection of named trees. These trees will generally contains mostly common files, but with certain files that need to vary.
Running this command:
$ python s3ts.zip upload-many dds-1.1.7 variants common
will upload one tree for each subdirectory in variants
. Each tree will contain the specific files from variants
and all of the files from common
.
To be more specific, running this command:
$ python upload-many dds-1.1.7 c:\src-1.1.7-kiosks c:\src-1.1.7
with a directory tree:
c:\src-1.1.7\data\...
c:\src-1.1.7\code\...
c:\src-1.1.7-kiosks\S03-2067-P-001\keys\...
c:\src-1.1.7-kiosks\S02-2011-P-001\keys\...
will result in two tress being created on S3: dds-1.1.7:S03-2067-P-001
and dds-1.1.7:S02-2011-P-001
. Each of these will contain the same data
and code
directories, but a specific keys
directory.
Within S3 files are broken into fixed sized chunks, and stored indexed under the hash of their contents. The files are compressed when beneficial. An index file stores the definition of the files making up a package. This has the following json representation:
{
"name": "junk-1.0",
"creationTime": "2015-04-24T11:14:41.120851",
"files": [
{
"chunks": [
{
"encoding": "zlib",
"sha1": "a3084ba0d54df1557b8868ee86f7a7a3248fd26a",
"size": 100
},
{
"encoding": "zlib",
"sha1": "1aabeb499f883d83ba6be493048c9aaf1d7bd138",
"size": 100
},
{
"encoding": "raw",
"sha1": "3d9ee170a58d56596f1eba74a671093ecb665480",
"size": 30
}
],
"path": "assets/car-01.db",
"sha1": "55af846da876a32091d08bd30b8bcb8e678899f6"
},
{
"chunks": [
{
"encoding": "raw",
"sha1": "54b52705acd64215a41eacd58bf1703f9d105db1",
"size": 45
}
],
"path": "code/file1.py",
"sha1": "54b52705acd64215a41eacd58bf1703f9d105db1"
},
{
"chunks": [
{
"encoding": "raw",
"sha1": "5ab45f0437d7ae582e94cb7def459ea6a377a6c3",
"size": 47
}
],
"path": "code/file2.py",
"sha1": "5ab45f0437d7ae582e94cb7def459ea6a377a6c3"
}
]
}
within the S3 bucket, the data is stored as follows:
- compressed chunks are stored as
chunks/zlib/XX/XXXXXXXXXXXXXXXXXX
where X.. are the digits of the SHA1 hash of the (uncompressed) chunk - raw chunks are stored as
chunks/raw/XX/XXXXXXXXXXXXXXXXXX
where X.. are the digits of the SHA1 hash of the chunk - The package definition files are stored as json blobs (with the structure shown above), as
trees/NAME
- Configuration settings for the store are saved as
config
. This is a json blob specifying whether compression is enabled for this store, and the chunk size.