filecache
is a Python module that abstracts away the location where files used or
generated by a program are stored. Files can be on the local file system, in Google Cloud
Storage, on Amazon Web Services S3, or on a webserver. When files to be read are on the
local file system, they are simply accessed in-place. Otherwise, they are downloaded from
the remote source to a local temporary directory. When files to be written are on the
local file system, they are simply written in-place. Otherwise, they are written to a
local temporary directory and then uploaded to the remote location (it is not possible to
upload to a webserver). When a cache is no longer needed, it is deleted from the local
disk.
filecache
is a product of the PDS Ring-Moon Systems Node.
The filecache
module is available via the rms-filecache
package on PyPI and can be
installed with:
pip install rms-filecache
The top-level file organization is provided by the FileCache
class. A FileCache
instance is used to specify a particular sharing policy and lifetime. For example,
a cache could be private to the current process and group a set of files that all have the
same basic purpose. Once these files have been (downloaded and) read, they are deleted as
a group. Another cache could be shared among all processes on the current machine and
group a set of files that are needed by multiple processes, thus allowing them to be
downloaded from a remote source only one time, saving time and bandwidth.
A FileCache
can be instantiated either directly or as a context manager. When
instantiated directly, the programmer is responsible for calling FileCache.delete_cache
directly to delete the cache when finished (a non-shared cache will be automatically
deleted on program exit). When instantiated as a context manager, a non-shared cache is
deleted on exit from the context. See the class documentation for full details.
Usage examples:
from filecache import FileCache
# Create a cache with a unique name that will be deleted on exit
with FileCache(None) as fc: # Use as context manager
# Also use open() as a context manager
with fc.open('gs://rms-filecache-tests/subdir1/subdir2a/binary1.bin', 'rb',
anonymous=True) as fp:
bin1 = fp.read()
with fc.open('s3://rms-filecache-tests/subdir1/subdir2a/binary1.bin', 'rb',
anonymous=True) as fp:
bin2 = fp.read()
assert bin1 == bin2
# Cache automatically deleted here
fc = FileCache(None) # Use without context manager
# Also retrieve file without using open context manager
path1 = fc.retrieve('gs://rms-filecache-tests/subdir1/subdir2a/binary1.bin',
anonymous=True)
with open(path1, 'rb') as fp:
bin1 = fp.read()
path2 = fc.retrieve('s3://rms-filecache-tests/subdir1/subdir2a/binary1.bin',
anonymous=True)
with open(path2, 'rb') as fp:
bin2 = fp.read()
fc.delete_cache() # Cache manually deleted here
assert bin1 == bin2
# Write a file to a bucket and read it back
with FileCache(None) as fc:
with fc.open('gs://my-writable-bucket/output.txt', 'w') as fp:
fp.write('A')
# The cache will be deleted here so the file will have to be downloaded
with FileCache(None) as fc:
with fc.open('gs://my-writable-bucket/output.txt', 'r') as fp:
print(fp.read())
The FCPath
class is a reimplementation of the Python Path
class to support remote
acess using an associated FileCache
. Like Path
, an FCPath
instance can contain any
part of a URI, but only an absolute URI can be used when actually accessing the file
specified by the FCPath
. In addition, an FCPath
can encapsulate various arguments such
as anonymous
and time_out
so that they do not need to be specified to each access
method. Thus, using this class can simplify the use of a FileCache
by allowing the user
to operate on paths using the simpler syntax provided by Path
, and to not specify
various other parameters at each method call site. If an FCPath
instance is created
without an explicitly-associated FileCache
, then the default FileCache()
is used,
which specifies a shared cache named "global"
that will persist after the program exits.
Compare this example to the one above:
from filecache import FileCache, FCPath
# Create a cache with a unique name that will be deleted on exit
with FileCache(None) as fc: # Use as context manager
# Use GS by specifying the bucket name and one directory level
p1 = fc.new_path('gs://rms-filecache-tests/subdir1', anonymous=True)
# Use S3 by specifying the bucket name and two directory levels
# Alternative creation method
p2 = FCPath('s3://rms-filecache-tests/subdir1/subdir2a', filecache=fc,
anonymous=True)
# Access GS using a directory + filename (since only one directory level
# was specified by the FCPath)
# The additional directory and filename are specified as an argument to open()
# Also use open() as a context manager
with p1.open('subdir2a/binary1.bin', 'rb') as fp:
bin1 = fp.read()
# Access S3 using a filename only (since two directory levels were already
# specified by the FCPath)
# The additional filename is specified by using the / operator to create a new
# FCPath instance; anonymous=True is inherited
with (p2 / 'binary1.bin').open(mode='rb') as fp:
bin2 = fp.read()
assert bin1 == bin2
# Cache automatically deleted here
A benefit of the abstraction is that different environments can access the same files in
different ways without needing to change the program code. For example, consider a program
that needs to access the file COISS_2xxx/COISS_2001/voldesc.cat
from the NASA PDS
archives. This file might be stored on the local disk in the user's home directory in a
subdirectory called pds3-holdings
. Or if the user does not have a local copy, it is
accessible from a webserver at
https://pds-rings.seti.org/holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat
.
Finally, it could be accessible from Google Cloud Storage from the rms-node-holdings
bucket at
gs://rms-node-holdings/pds3-holdings/volumes/COISS_2xxx/COISS_2001/voldesc.cat
. Before
running the program, an environment variable could be set to one of these values:
$ export PDS3_HOLDINGS_SRC="~/pds3-holdings"
$ export PDS3_HOLDINGS_SRC="https://pds-rings.seti.org/holdings"
$ export PDS3_HOLDINGS_SRC="gs://rms-node-holdings/pds3-holdings"
Then the program could be written as:
from filecache import FileCache
import os
with FileCache(None) as fc:
p = fc.new_path(os.getenv('PDS3_HOLDINGS_SRC'))
voldesc_path = p / 'volumes/COISS_2xxx/COISS_2001/voldesc.cat'
contents = voldesc_path.read_text()
# Cache automatically deleted here
If the program was going to be run multiple times in a row, or multiple copies were going
to be run simultaneously, using a shared cache would allow all of the processes to share
the same copy, thus requiring only a single download no matter how many times the program
was run. A shared cache is indicated by giving the cache a name (or no argument, which
defaults to "global"
); also FCPath
defaults to using the global cache if no
FileCache
is specified. This results in the simplest form of the program:
from filecache import FCPath
import os
p = FCPath(os.getenv('PDS3_HOLDINGS_DIR'))
voldesc_path = p / 'volumes/COISS_2xxx/COISS_2001/voldesc.cat'
contents = voldesc_path.read_text()
Finally, there are four classes that allow direct access to the four possible storage
locations without invoking any caching behavior: FileCacheSourceLocal
,
FileCacheSourceHTTP
, FileCacheSourceGS
, and FileSourceCacheS3
:
from filecache import FileCacheSourceGS
src = FileCacheSourceGS('gs://rms-filecache-tests', anonymous=True)
src.retrieve('subdir1/subdir2a/binary1.bin', 'local_file.bin')
Details of each class are available in the module documentation.
Information on contributing to this package can be found in the Contributing Guide.
This code is licensed under the Apache License v2.0.