Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul Object Store Backends #18117

Closed
wants to merge 40 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
31a9bc2
Add test case for GCP S3 Interop Object Stores.
jmchilton May 6, 2024
95913c6
De-dupclication around object store permission restoration.
jmchilton May 6, 2024
fa53abd
De-duplication around object store _construct_path.
jmchilton May 6, 2024
08e0d10
Reduce duplication around _pull_into_cache.
jmchilton May 7, 2024
e2c3673
Converge a few cloud object stores toward same names ahead of refactor.
jmchilton May 7, 2024
6e79b10
De-duplicate object store _exists across 4 object stores.
jmchilton May 7, 2024
b098da3
Remove unused object store method.
jmchilton May 7, 2024
c8147f3
Remove duplicated _create logic across 4 object stores.
jmchilton May 7, 2024
65d4018
Remove duplicated _empty across caching object stores.
jmchilton May 7, 2024
6b2c0f1
Refactor names to make clear copy-paste.
jmchilton May 7, 2024
3cf2ec1
Eliminate copy-paste for _size across 4 objct stores.
jmchilton May 8, 2024
9996d3f
Remove duplication around _get_data in object stores.
jmchilton May 7, 2024
8038932
Remove _in_cache from s3 (rebase above somewhere?)
jmchilton May 7, 2024
be5983d
Remove duplication around _get_filename.
jmchilton May 7, 2024
8aabf8b
Remove get_file_name from cloud.
jmchilton May 8, 2024
7c8978b
Remove duplication around _update_from_file.
jmchilton May 8, 2024
465ca6f
Refactor object stores so we can remove _delete...
jmchilton May 7, 2024
2fd544e
Remove duplication around _delete in object stores.
jmchilton May 7, 2024
d50ed66
Remove a bunch of extra imports now.
jmchilton May 8, 2024
c8a0834
Remove some duplication around cache targets.
jmchilton May 8, 2024
7229ae3
Remove unused copy-pasta from cloud object store.
jmchilton May 8, 2024
9341bfb
Remove wrong comments.
jmchilton May 8, 2024
deb0494
Remove duplication around cache monitor starting.
jmchilton May 8, 2024
1ac4a0b
Converge _download a bit if needed.
jmchilton May 8, 2024
ab0f3a7
Re-work object store de-duplication for proper typing.
jmchilton May 8, 2024
702fe36
Fix duplication around cache logging.
jmchilton May 8, 2024
c0e09c9
De-duplicat _push_to_os.
jmchilton May 8, 2024
c429047
Fix irods test cases for code refactoring.
jmchilton May 8, 2024
6426233
Fix axel call across s3 & cloud object stores.
jmchilton May 8, 2024
7fd767e
Merge axel download stuff together.
jmchilton May 8, 2024
13e1b32
Implement a boto3 object store.
jmchilton May 7, 2024
8dadbcf
Implement advanced transfer options for boto3.
jmchilton May 8, 2024
8e59873
Rev object store docs.
jmchilton May 9, 2024
c19e28e
Allow older style connection parameters on newer boto3 object store.
jmchilton May 9, 2024
7506f43
Drop logging that contains a password.
jmchilton May 9, 2024
c6a2473
Revise object store changes based on PR review.
jmchilton May 9, 2024
524f2ae
Prevent incomplete files from sticking around in the object store cache.
jmchilton May 9, 2024
8d060b3
Do not allow creation of object stores with non-writable caches.
jmchilton May 9, 2024
52e03da
Various object store extra files handling fixes.
jmchilton May 10, 2024
9d1ba54
Fix extra files handling for cached object stores when reset_cache ca…
jmchilton May 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 84 additions & 3 deletions lib/galaxy/config/sample/object_store_conf.sample.yml
Original file line number Diff line number Diff line change
Expand Up @@ -135,10 +135,64 @@ backends:
store_by: uuid
files_dir: /old-fs/galaxy/files


# There are now four ways to access S3 related services. Two are
# suitable just for AWS services (aws_s3 & cloud), one is
# more suited for non-AWS S3 compatible services (generic_s3),
# and finally boto3 gracefully handles either scenario.
#
# boto3 is built on the newest and most widely used Python client
# outside of Galaxy. It has advanced transfer options and is likely
# the client you should use for new setup. generic_s3 and aws_s3
# have existed in Galaxy for longer and could perhaps be considered
# more battle tested. Both boto3 and generic_s3 have been tested
# with multiple non-AWS APIs including minio and GCP. The cloud
# implementation is based on CloudBridge and is still supported
# and has been recently tested - the downside is mostly the advanced
# multi-threaded processing options of boto3 are not available
# and it has not been battle tested like aws_s3.

#
# Sample AWS S3 Object Store configuration (newest boto3 client)
#
type: boto3
auth:
access_key: ...
secret_key: ...
bucket:
name: unique_bucket_name_all_lowercase
connection: # not strictly needed but more of the API works with this.
region: us-east-1
transfer:
multipart_threshold: 10000000
download_max_concurrency: 5
upload_max_concurrency: 10
# any of these options:
# multipart_threshold, max_concurrency, multipart_chunksize,
# num_download_attempts, max_io_queue, io_chunksize, use_threads,
# and max_bandwidth
# can be set. By default they will apply to uploads and downloads
# but they can be prefixed with upload_ or download_ as shown above
# to apply to just one scenario. More information about these parameters
# can be found at:
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig

cache:
path: database/object_store_cache_s3
size: 1000
cache_updated_data: true
extra_dirs:
- type: job_work
path: database/job_working_directory_s3



#
# Sample AWS S3 Object Store configuration
# Sample AWS S3 Object Store configuration (legacy boto implementation)
#

# This implementation will use axel automatically for file transfers if it is on
# Galaxy's path. Otherwise, it will use various python-based strategies for multi-part
# upload of large uploads but all downloads will be single threaded.
type: aws_s3
auth:
access_key: ...
Expand All @@ -147,6 +201,8 @@ bucket:
name: unique_bucket_name_all_lowercase
use_reduced_redundancy: false
max_chunk_size: 250
connection: # not strictly needed but more of the API works with this.
region: us-east-1
cache:
path: database/object_store_cache_s3
size: 1000
Expand Down Expand Up @@ -182,7 +238,32 @@ extra_dirs:
path: database/job_working_directory_irods

#
# Sample non-AWS S3 Object Store (e.g. swift) configuration
# Sample non-AWS S3 Object Store (e.g. swift) configuration (boto3)
#

type: boto3
auth:
access_key: ...
secret_key: ...
bucket:
name: unique_bucket_name_all_lowercase
connection:
endpoint_url: https://swift.example.org:6000/
# region: some services may make use of region is specified.
# older style host, port, secure, and conn_path available to generic_s3 work
# here also - Galaxy will just infer a endpoint_url from those.
cache:
path: database/object_store_cache_swift
size: 1000
cache_updated_data: true
# transfer: # see transfer options for boto3 above in AWS configuration.
extra_dirs:
- type: job_work
path: database/job_working_directory_swift


#
# Sample non-AWS S3 Object Store (e.g. swift) configuration (legacy boto client)
#

type: generic_s3
Expand Down
3 changes: 3 additions & 0 deletions lib/galaxy/dependencies/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,9 @@ def check_python_pam(self):
def check_azure_storage(self):
return "azure_blob" in self.object_stores

def check_boto3(self):
return "boto3" in self.object_stores

def check_kamaki(self):
return "pithos" in self.object_stores

Expand Down
62 changes: 32 additions & 30 deletions lib/galaxy/objectstore/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@
from .caching import CacheTarget

if TYPE_CHECKING:
from galaxy.model import DatasetInstance
from galaxy.model import (
Dataset,
DatasetInstance,
)

NO_SESSION_ERROR_MESSAGE = (
"Attempted to 'create' object store entity in configuration with no database session present."
Expand Down Expand Up @@ -373,16 +376,6 @@ def shutdown(self):
"""Close any connections for this ObjectStore."""
self.running = False

def file_ready(
self, obj, base_dir=None, dir_only=False, extra_dir=None, extra_dir_at_root=False, alt_name=None, obj_dir=False
):
"""
Check if a file corresponding to a dataset is ready to be used.

Return True if so, False otherwise
"""
return True

@classmethod
def parse_xml(clazz, config_xml):
"""Parse an XML description of a configuration for this object store.
Expand Down Expand Up @@ -938,10 +931,6 @@ def _exists(self, obj, **kwargs):
"""Determine if the `obj` exists in any of the backends."""
return self._call_method("_exists", obj, False, False, **kwargs)

def file_ready(self, obj, **kwargs):
"""Determine if the file for `obj` is ready to be used by any of the backends."""
return self._call_method("file_ready", obj, False, False, **kwargs)

def _create(self, obj, **kwargs):
"""Create a backing file in a random backend."""
objectstore = random.choice(list(self.backends.values()))
Expand Down Expand Up @@ -1400,6 +1389,10 @@ def type_to_object_store_class(store: str, fsmon: bool = False) -> Tuple[Type[Ba
objectstore_constructor_kwds = {}
if store == "disk":
objectstore_class = DiskObjectStore
elif store == "boto3":
from .s3_boto3 import S3ObjectStore as Boto3ObjectStore

objectstore_class = Boto3ObjectStore
elif store in ["s3", "aws_s3"]:
from .s3 import S3ObjectStore

Expand Down Expand Up @@ -1672,18 +1665,27 @@ def persist_extra_files(
if not extra_files_path_name:
extra_files_path_name = primary_data.dataset.extra_files_path_name_from(object_store)
assert extra_files_path_name
for root, _dirs, files in safe_walk(src_extra_files_path):
extra_dir = os.path.join(extra_files_path_name, os.path.relpath(root, src_extra_files_path))
extra_dir = os.path.normpath(extra_dir)
for f in files:
if not in_directory(f, src_extra_files_path):
# Unclear if this can ever happen if we use safe_walk ... probably not ?
raise MalformedContents(f"Invalid dataset path: {f}")
object_store.update_from_file(
primary_data.dataset,
extra_dir=extra_dir,
alt_name=f,
file_name=os.path.join(root, f),
create=True,
preserve_symlinks=True,
)
persist_extra_files_for_dataset(object_store, src_extra_files_path, primary_data.dataset, extra_files_path_name)


def persist_extra_files_for_dataset(
object_store: ObjectStore,
src_extra_files_path: str,
dataset: "Dataset",
extra_files_path_name: str,
):
for root, _dirs, files in safe_walk(src_extra_files_path):
extra_dir = os.path.join(extra_files_path_name, os.path.relpath(root, src_extra_files_path))
extra_dir = os.path.normpath(extra_dir)
for f in files:
if not in_directory(f, src_extra_files_path):
# Unclear if this can ever happen if we use safe_walk ... probably not ?
raise MalformedContents(f"Invalid dataset path: {f}")
object_store.update_from_file(
dataset,
extra_dir=extra_dir,
alt_name=f,
file_name=os.path.join(root, f),
create=True,
preserve_symlinks=True,
)
Loading
Loading