Query SQLite databases in S3 using s3fs
APSW SQLite VFS. This VFS enables reading databases from S3 using s3fs. This only supports reading operations, any operation that tries to modify the DB file is ignored.
Inspired by sqlite-s3vfs and sqlite-s3-query.
This VFS will only work when the DB file is in any journal mode that is not WAL. However, it will work if you set the journal mode to something else just before uploading the file to S3. You can (and probably should) use WAL mode to generate the DB. Then you can change the journal mode (and the page size if you neeed) before uploading it to S3.
The test suite
includes
tests for that use case. Take into account that the page size can't be changed
when the database is in WAL mode. You need to change it before setting the WAL
mode or by setting the database to rollback journal mode. You need to execute
VACUUM;
after changing the page
size in a SQLite database.
import s3fs
import s3sqlite
import apsw
# Create an S3 filesystem. Check the s3fs docs for more examples:
# https://s3fs.readthedocs.io/en/latest/
s3 = s3fs.S3FileSystem(
key="somekey",
secret="secret",
client_kwargs={"endpoint_url": "http://..."},
)
s3vfs = s3sqlite.S3VFS(name="s3-vfs", fs=s3)
# Define the S3 location
key_prefix = "mybucket/awesome.sqlite3"
# Upload the file to S3
s3vfs.upload_file("awesome.sqlite3", dest=key_prefix)
# Create a database and query it
with apsw.Connection(
key_prefix, vfs=s3vfs.name, flags=apsw.SQLITE_OPEN_READONLY
) as conn:
cursor = conn.execute("...")
print(cursor.fetchall())
python3 -m pip install s3sqlite
The testing script will use the Chinook
database, it will modify (and
VACUUM;
) the file to use all the possible combinations of journal modes and
page sizes
- Download the chinook database:
curl https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite -o chinook.sqlite3
- Make sure you have Docker installed.
The testing script will start a MinIO container to run the tests locally. After the tests finish, the container will be stopped atuomatically.
- Run the tests:
python3 -m pytest test.py
- sqlite-s3vfs: This VFS stores the
SQLite file as separate DB pages. This enables having a single writer without
having to overwrite the whole file.
s3sqlite
's main difference is that this just needs uploading a single file to S3.sqlite-s3vfs
will split the database in pages and upload the pages separately to a bucket prefix. Having just a single file has some advantages, like making use of object versioning in the bucket. I also think that relying on s3fs makes the VFS more flexible than callingboto3
assqlite3-s3vfs
does.s3fs
should also handle retries automatically. - sqlite-s3-query: This VFS is very
similar to
s3sqlit
, but it usesctypes
directly to create the VFS and useshttpx
to make requests to S3.
I decided to create a new VFS that didn't require using ctypes
so that it's
easier to understand and maintain, but I still want to have a single file in S3
(vs. separate DB pages). At the same time, by using
s3f3 I know I can use any S3
storage supported by that library.
The Chinook database used for testing can be obtained from: https://github.com/lerocha/chinook-database/
The testing section in this README contains a command you can run to get the file.
Distributed under the Apache 2.0 license. See LICENSE
for more information.