-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empower users to bring their own storage and file sources #18127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely amazing and magic 🪄
9a09b42
to
6f2fcd5
Compare
b4698c1
to
8701c69
Compare
0301415
to
738f4c0
Compare
This feature is amazing! Thank you, @jmchilton! I attempted to set up and test it on my local. The overall functionality is great, but some aspects must be refined and improved. I will document these in the #18128 issue. |
030fa3d
to
36b51d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is utterly amazing! I'm pretty sure users will find this extremely useful.
I tested it a bit and the only things I could find were the following:
Removing file source instances does not remove it from the UIYou just fixed it while I was reviewing! great!- If you don't provide a
description
for a file source instance it will fail validation when opening the list of file sources. This is not a bug of this PR, but rather in the schema definition ofFilesSourcePlugin
as thedoc
field should be optional.--- a/lib/galaxy/schema/remote_files.py +++ b/lib/galaxy/schema/remote_files.py @@ -55,8 +55,8 @@ class FilesSourcePlugin(Model): description="The display label for this plugin.", examples=["Library Import Directory"], ) - doc: str = Field( - ..., + doc: Optional[str] = Field( + None, title="Documentation",
- I also missed the ability to set the file sources as
writeable
by the user. I guess the admin can set it up in the template and it will work, but this can be a follow-up enhancement.
In general, we can iron out some small things after we battle-test this, especially around the UX, but it seems to be working as advertised and has pretty good testing coverage 🚀
@davelopez Thanks for the kind and detailed review. I've double checked that doc being null is allowed in the file source plugins themselves and then I made that schema switch you mentioned. Thanks so much for catching that! |
093d62b
to
b2f251d
Compare
b2f251d
to
9262bcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a conflict here, could you resolve that ?
Add URLs where I have them in the dropdown.
Again thanks to David!
9262bcb
to
5cca011
Compare
@jmchilton I just noticed this.. Can you, please, use alembic to autogenerate migration hashes next time? Otherwise we get output like this when running the history command: |
The files appear right next to each other which makes editing them together easier... besides there is no hard proof I falsified that hash - just speculation 😇😆 . If this ever happens again I'll be good and respect the process even if I think it is arbitrary. |
Overview
This PR introduces plugin frameworks that allows admin to describe templates for object stores and file sources that users can setup themselves - this effectively allows users to define custom sources for getting files into Galaxy, exporting files out of Galaxy, and storing datasets in Galaxy. This work includes extensive documentation on the framework itself and on a wide variety of tested, production quality, and ready-to-use examples for major cloud services as well as a wide variety of examples that could be used to tailor the framework to local infrastructure.
Background
Pull Requests
This is very heavily based on #14073, #12940, #18117, and #15875.
#14073 is...
#12940 is used to store secrets and in a structured way with multiple potential backends.
#18117 overhauled the object stores themselves so we have a solid foundation to build these plugins up from.
#15875 is these ideas but only applied to object stores and lacking the depth of typing, testing, documentation.
Issues
Implements #13816.
Fixes #8790
Datasets vs Files, Object Stores vs File Sources
(This section of the PR description is ripped from the new data documentation included in this PR).
File sources in Galaxy are a sprawling concept but essentially they provide users access to simple files (stored hierarchically into folders) that can be navigated and imported into Galaxy. Importing a "file" into Galaxy generally creates a copy of that file into a Galaxy "object store". Once these files are stored in Galaxy,
they become "datasets". A Galaxy dataset is much more than a simple file - Galaxy datasets include various generic metadata a datatype, datatype specific metadata, and ownership and sharing rules managed by Galaxy.
Galaxy object stores (called "storage locations" in the UI) store datasets and global (accessible to all users) object stores are configured with the
galaxy.yml
propertyobject_store_config_file
(orobject_store_config
for a configuration embedded right ingalaxy.yml
) that defaults toobject_store_conf.xml
orobject_store_conf.yml
if either is present in Galaxy's configuration directory. Galaxy file sources provide users access to raw files and global files sources are configured with thegalaxy.yml
propertyfile_sources_config_file
(orfile_sources
for embedded configurations) that defaults tofile_sources_conf.yml
if that file is present in Galaxy's configuration directory.Some of Galaxy's most updated and complete administrator documentation can be found in configuration sample files - this is definitely the case for object stores and file sources. The relevant sample configuration files include file_sources_conf.yml.sample and object_store_conf.sample.yml.
File sources and object stores configured with the above files essentially are available to all users of your Galaxy instance - hence this document describes them as "global" file sources and object stores. File source configurations do allow some templating that does allow the a global file source to be materialized differently for different users. For instance, you as an admin may setup a Dropbox file source and may explicitly add custom user properties that allow that single Dropbox file source to read from a user's preferences. Since there is just one Dropbox service and most people only have a single Dropbox account, this use case can be somewhat adequately addressed by the global file source and the global user preferences file. For a use case like Amazon S3 buckets though for instance, a single bucket file source that is parameterized one way is probably more clearly inadequate. For instance, users would very likely want to attach different buckets for different projects. Additionally, the Galaxy user interface doesn't tie the user preferences to the particular file source and so this method introduces a huge education burden on your Galaxy instance. Finally, the templating available to file sources are not available for object stores - and allowing users to describe how they would like datasets stored and to pay for their own dataset storage are important use cases.
This implements Galaxy configuration template libraries that allow the administrator to setup templates for file sources and object stores that users may instantiate as they see fit. User's can instantiate multiple instances of any template, the template concept can apply to both file source and object store plugins, and the user interface is unified from the template configuration file.
Implementation
File Source and Object Store Templates
Admins can define a set of object store "templates" and a set of file source “templates” - for either type, the configured set of these templates is called the "catalog" of (either object store or file source) templates. These are currently defined in
object_store_templates.yml
andfile_source_templates.yml
in theconfig
directory or can be directly embedded into Galaxy’s main configuration (galaxy.yml
) withobject_store_templates
andfile_source_templates
. The documentation includes extensive details on how to define the templates, what the syntax looks like, a description of use cases and examples of many such templates, and a set of production-ready templates that are completely generic and would feel at home in any general purpose Galaxy instance.Very strict Pydantic models are included for the templates and for the resulting object store and file source configurations that they would yield when bound to user supplied "variables" and "secrets". The models are the source of truth about the documentation and we generate various images from them to be explicit about this and stick them in the documentation included here. We are going to store configurations of object stores and file sources in the database so the JSON blobs we define should be extremely well defined and well tested so we can have old blobs continue to work as the interface to object stores and files sources evolve over time. Proposed configurations for
disk
,boto3
,azure_blob
object storage have been included - in addition toaws_s3
andgeneric_s3
legacy configurations that target the older object store. These expose the relevant knobs available in our object store configurations currently and should be adapted as we migrate the object store code. Proposed configurations forposix
,s3fs
,ftp
, andazure
file sources have been included as well. The documentation includes details on all these object store and file source types, syntax descriptions, and various examples when they make sense.The templates in the catalog can be hidden and new versions can be appended and the old ones will be automatically hidden - but admins should be warned that older definitions should remain for existing defined plugins. The UI has the ability to let users “upgrade” their instances of plugins as new template versions become available. The semantics are similar to tool re-running with newer versions - but much more structured is applied at every level.
The templates are parameterized with variables and secrets - and can include admin supplied fields and admin injected secrets that can be injected via global Vault values or environment variables.
I'm using Jinja templating as opposed to Python string templating or mako templating. Various plugins to Galaxy have used all three approaches, I've gone back and forth on this but Jinja seems the best fit because it preserves type information (in this implementation) - which seems to be where Galaxy is heading and dovetails well with the level structure and typing we're using throughout this implementation (from the database to typescript schema consumed by the frontend).
The documentation contains reference details on everything injected into the Jinja environment, lots of examples for admins to develop against, and a whole section on how to manage Jinja templates with Ansible - in case admins want to attempt to template the templates (there is no recommendation they do that - but I suspect they may want to and I’ve thought through and documented how to do that).
Database Models
The templates can be used by users to create
UserObjectStore
andUserFileSource
model instances. I've used a prototype to separate the implementation from the object store code so an object store library consumer could store these on disk or in some other persistence store but for the purposes of the Galaxy application - instantiations of templates created by users are stored in the database inUserObjectStore
instances and are called "object store instances" in the API. Likewise instantiations ofUserFileSource
objects are called “file source instances”. (I think not calling them User Object Stores in the API makes sense because one can easily imagine group or role implementations of these things in the database and one would expect the API to work with all of those).The UserObjectStore model is:
The UserFileSource model can be found in galaxy.models and looks very similar.
Backend Plumbing
The catalog and related models all so far are decoupled from the rest of Galaxy outside the object store. The layer above that is in
lib/galaxy/managers/object_store_instances.py
that ties together the database objects, the vault, templates, and object store factory methods to implement most of the target functionality. Code for creating, updating, and upgrading user object stores from one template version to the next are all defined in this file as well as some relevant CRUD code. Likewise,file_source_instances.py
plays a similar role to establish the decoupling and isolation of user file sources.API Endpoints have been added to:
Alternatives
The PR write up of #14073 describes in detail how it provided several abstractions that would be needed to address limitations of the work proposed in #14073. In additions to the description of the limitations described there - this work will be implemented with a keen eye toward implementation efficiency and will be usable with essentially any concrete object store implementation as opposed to tightly coupling to the cloud object store. I am confident the result of this will allow admins to address a greater number of potential scenarios.
The Docs
Here are some screenshots of the docs rendered. Here is the table of contents:
This demonstrates examples and docs of particular plugins:
along with this:
Hands-On Production Examples
Open
config/file_source_templates.yml
and place this in this contents to load up production-quality examples:Open
config/object_store_templates.yml
and place this in the contents to load up production-quality examples:These production quality templates are described in a bit more detail in the documentation and have all been tested (I think). Many have screenshots in the documentation added as part of this PR as well.
After updating those configuration files and restarting Galaxy, you can now go to User Preferences and see the new options for creating personal file sources and object stores.
Warning: You will need to have a distributed object store for any of this to work and nearly every example also requires a Vault (the public AWS file source is a good example that does not require a Vault).
Clicking the file source option should show the available options for file sources templates. Hovering over an option will show the detailed Markdown description for that template.
The same is possible for
Some screenshots generated from Selenium test cases that exercise these examples include:
Public AWS Bucket File Source
Azure Blob File Source
FTP File Source
Azure Blob Object Store
AWS S3 Object Store
Google Storage via S3 Interop Object Store
Older Example (Detailed)
**This is the older example from the original PR which used a custom MinIO server. It is much more contrived but also is a complete walk-through of the concepts as applied to object store. Past @jmchilton put some time into these. **
My notes on setting up MinIO for this example - need a bucket to attach:
Next setup a sophisticated distributed object store, going to build on the MSI example I used for #14073.
Next add an object_store_templates.yml file to
config/
:This sets up three templates users can create object stores from in the UI.
The first two just allow the user to setup folders under a shared project directories. This example makes sense when you really trust your users and you've got a variety of disk options mounted on Galaxy servers with different properties.
The third template allows the user to attach buckets from the MinIO server we setup - using access key, bucket names, and secret keys we've communicated to the user in some way.
The User Preferences menu now has a "Manage Your Object Stores" option:
Clicking "Create" will show the templates the user can create object stores from:
Let's build one of each of these:
As they are built we see them in the index:
Object store badges communicate information about the object store, its properties, and free Markdown populated by the admin for different object stores:
Some information about the type of object store is displayed also:
When you edit the object stores, regular settings (metadata and admin defined variables) are presented in a different way than secrets stored in Galaxy's vault:
Workflows, tools, histories will all now allow these object stores to be selected as the "preferred" object store. The user can also select this as their preference for all analyses:
These two new user-bound object stores are now available right alongside the admin defined ones in object_store_conf.xml.
Here I've set the history default and ran and job and we can see the result in a MinIO because the path is the path to object store cache:
Looking in the object store management window:
We can see the file that was created.
How to test the changes?
(Select all options that apply)
License