-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable URI handling across upload components. #9888
Conversation
9cd92f7
to
01a5162
Compare
4ef47eb
to
5225a00
Compare
Tried again today and it's looking great! Didn't need to specify the ftp options, and the posix path option worked very nicely. Very intuitive to use too. Some minor issues I ran into:
Some enhancements:
All in all, this is a massive improvement to usability, and should be a major reason to rush 20.09 out :-) (I confess to having a vested interested in this - this will allow webdav handling to be much better in general over the tool that I've been working on) |
191f8a7
to
b4927fa
Compare
I'm pulling this out of WIP because I think what is here represents an atomic first pass despite being sprawling at this point. But I would still like to address these issues. @nuwang - I put a lot of the GUI issues you mentioned over into #9942. I think I agree with most of the requests, but I think they are iteration 2 sorts of things that it would be easier to address or delegate after this is merged.
So this can be done using the rule builder now - this isn't in the screenshots above yet but you can select a whole directory and load it in to be worked on using the rule builder to parse out metadata. But it would be both good to have a way to select all the contents and to start with a directory and use something simpler than the rule builder to pull out the metadata. I did mention this on the linked issue.
Obviously totally agree and I'm definitely on board - this would be a game changer. I've created an issue here #9948.
I don't know how we would do that - but it sounds fun. Want to create an issue describing that in more detail? I can't really imagine it. I will also keep working on the webdav that isn't tied to a root URL. My first task is to determine if the problem is with the framework or the plugin. |
@jmchilton this is awesome! How do you envision to save the user-credentials? Storing them in a safe way would help all kinds of data exporters. One that we should focus on imho are Zenodo exporter and SRA/ENA ones. |
@bgruening This plugin approach has many applications that don't involve needing to capture and store secrets from users but I'm sure that is the right question. I haven't done any research so I wouldn't want to be the one to pick a best practice right now but I think this approach could work with many different sources of private information. The example above uses user preferences - which could be made slightly more secure with #9876 but I think we should research and invest in an external secret manager ideally. Storing this stuff unencrypted in our dataset isn't something we want to do long term. That said - I do think user preferences and such are a big step forward from the tool framework which we know people are using currently so I don't hate it as a medium term thing. |
This is on the roadmap for the Custos project and will be integrated into Galaxy once the service there becomes available. |
Fully agree! I was just wondering if you have a master plan already for secrets :) |
@afgane very nice, thanks for sharing! Are there more information about this and the plans you have? |
I believe this is as far as implementation on that topic got: apache/airavata-custos#68. The idea is to use Vault behind the scenes and provide an API to science gateways to consume. There's a team meeting scheduled tomorrow at 11am ET where I'll bring up this secrets service. Anyone interested is welcome to join (https://iu.zoom.us/j/788176034). |
@jmchilton sorry, this needs a rebase. |
*Overview* This work defines an interface for interacting with "filesystem"-like entities during "upload". In addition to being a pluggable framework adding important new capabilities to Galaxy, this is a generalization and formalization of existing file sources (e.g. the directories described by `library_import_dir`, `user_library_dir`, and `ftp_upload_dir`). *Plugin Infrastructure* This introduces a new plugin `FilesSource` to represent sources of directories and files during "upload". A `FilesSource` plugin should be able to index directories and download (called 'realize' to be generic) files to local posix directories. Indexing is used by the remote_files API to provide the client with hierarchies to navigate and to build URIs for the files. The 'realize' operation is used by the 'upload1' and '__DATA_FETCH__' and tools during upload to bring the files into Galaxy as datasets. An instance of the `ConfiguredFileSources` class is responsible for managing individual instances of `FilesSource` plugins. It has methods to map URIs to the appropriate plugin instance. The `ConfiguredFileSources` class tracks the loaded plugins and reuses the go to `galaxy.util.plugin_config` module for loading YAML (or XML) definitions of plugins (the same dependency resolvers, job metrics, auth backends, etc. do). A `ConfiguredFileSources` object can serialize itself to a file and re-materialize it during job execution to allow using this abstraction during uploads. When operating within the Galaxy app, the `ConfiguredFileSources` uses an adapter pattern to parse user-level information from Galaxy's `trans` object. During serialization, the `ConfiguredFileSources` object is expected to encode all the required information about the user that is needed into the output JSON description of the file sources. This is because the web transaction won't be available remotely during the upload job. These objects working in such different ways between the Galaxy process and in the remote job is mildly jarring - so unit tests have been written to ensure this all functions properly. *Plugin Implementations* The `FilesSource` interface has a helper implementation base class `BaseFilesSource` that provides some assistance for plugin development. Additionally, the base class `PyFilesystem2FilesSource` extends `BaseFilesSource` but assumes a PyFilesystem2 implementation exists to target the file source of interest - so the plugin author need only provide a PyFilesystem `FS` object describing the target. This commit includes three concrete implementations - posix, webdav, and dropbox. `posix` extends `BaseFilesSource` while the others are light-weight extensions of `PyFilesystem2FilesSource`. **posix** While one could imagine a very lightweight implementation based on `PyFilesystem2FilesSource` this fully worked through plugin is implemented directly to ensure we respect Galaxy's strong security checks on paths containing symlinks and preserve the semantics `user_library_import_symlink_allowlist`. **webdav** Galaxy tools for integrating OwnCloud exist - see https://github.com/shiltemann/Galaxy-Owncloud-Integration, part of the driver for this work was extending that idea to provide more integrated UX for uploading that data. So this work includes a WebDav plugin (and associated test cases) that could potentially target OwnCloud. This plugin was a good exercise in flushing and testing the PyFilesystem2 interface but the PyFilesystem2 WebDAV implementation seems a bit fragile... we might want to replace it with more direct APIs but we can take a wait and see approach. The config YAML for a webdav plugin that lets user's target their own OwnCloud servers configured via user preferences might look something like: ``` - type: webdav id: owncloud1 label: OwnCloud doc: User-configured OwnCloud files url: ${user.preferences['owncloud|url']} login: ${user.preferences['owncloud|username']} password: ${user.preferences['webdav|password']} ``` The configuration would provide a user's OwnCloud files at `gxfiles://owncloud1/`. If instead, a big centralized WebDav server is made available with public data for all users (mirroring use cases of `library_import_dir`) - a simpler configuration not requiring user preferences might be something like: ``` - type: webdav id: lab label: Lab WebDAV server doc: Our lab's research files managed at ourlab.org. url: http://ourlab.org:7083 login: ${environ.get('WEBDAV_LOGIN')} password: ${environ.get('WEBDAV_PASSWORD')} ``` The configuration would provide a these WebDAV files at `gxfiles://lab/`. These two examples demonstrate basic templating is allowed inside the YAML configuration. These are Cheetah templates exposing very specific views of the 'user', 'config', and the whole 'environ' available to the Galaxy server. **dropbox** The Dropbox PyFilesystem2 plugin is even easier to configure, all that is needed is a Dropbox access token (this can be configured from the settings menu and may be isolated to a specific app specific folder for added security on the user's part). An example of such a plugin might be: ``` - type: dropbox id: dropbox1 label: Dropbox Files doc: Your Dropbox files - configure an access token via the user preferences accessToken: ${user.preferences['dropbox|access_token']} ``` The configuration would provide a user's Dropbox files at `gxfiles://dropbox1/`. **gxftp** This is an automatically populated plugin (if `ftp_upload_dir` is configured in Galaxy) that provides the user's FTP files at `gxftp://`. **gximport** This is an automatically populated plugin (if `library_import_dir` is configured in Galaxy) that provides Galaxy's library import files at `gximport://`. **gxuserimport** This is an automatically populated plugin (if `user_library_import_dir` is configured in Galaxy) that provides the requesting user's Galaxy's user library import files at `gximportfiles://`. *Why not a tool?* One could imagine a tool - but the upload dialog has many advanced options for selecting how to ingest files (convert tabs and newlines, select format vs. detect, select dbkey, organize into collections, organize via rules, etc...). It would be next to impossible to provide all these same options via a normal tool and the user experience would be very different than using the upload components in Galaxy - which have been optimized and designed for this task. That said - one future direction I would like to take this is to be able to mark plugins as writable and implement a new tool form input type "export_directory" or something like that. This could then be used to write data export tools. This could be used to write generalizations of the the cloud send tool. *`ObjectStore` vs `FilesSource`* ObjectStores provide datasets not files, the files are organized logically in a very flat way around a dataset. `FilesSource` s instead provide files and directories, not datasets. A `FilesSource` is meant to be browsed in hierarchical fashion - and also has no concept of extra files, etc.. *Future Work* - This is hopefully going to serve as the basis of a first pass at Terra integration with Galaxy using the FISS lib. Having an implementation based on `PyFilesytem2` means we could potentially integrate support for S3, Basespace, Google Drive, OneDrive, etc.. - Tool form support for selecting files for import and directories for export. - Allow writing collection archives, history export, etc.. to the `FilesSource` - this would really enhance the UI around getting big stuff out of Galaxy potentially I think. Rebase into galaxy.files...
- Extend FileDialog to allow selection of directories. - Add new rule source that is a remote directory, pre-loaded into the rule builder with URL column assigned from the metadata.
b4927fa
to
7f3a595
Compare
Ah, awesome sauce! This is amazing! Hopefully, we will see many plugins in the future. |
Overview
This work defines an interface for interacting with "filesystem"-like entities during "upload". In addition to being a pluggable framework adding important new capabilities to Galaxy, this is a generalization and formalization of existing file sources (e.g. the directories described by
library_import_dir
,user_library_dir
, andftp_upload_dir
).Screenshots
Here is my Dropbox where I created an access token tied to an App folder called galaxytest.
Here is the slightly modified upload dialog box - it now says select remote files instead of select FTP files.
I will readily admit the new UI is more utilitarian (in the worst way) than the previous FTP popover code - so I kept all that code in place - we need to figure out how to best present these new selection options, it may not be this dialog.
My Dropbox files!
The times aren't there because it isn't exposed in the API we are using, but if I navigate a WebDAV source or the existing Galaxy directories such as my FTP directory (pictured below) these times are available:
Plugin Infrastructure
This introduces a new plugin
FilesSource
to represent sources of directories and files during "upload". AFilesSource
plugin should be able to index directories and download (called 'realize' to be generic) files to local posix directories. Indexing is used by the remote_files API to provide the client with hierarchies to navigate and to build URIs for the files. The 'realize' operation is used by theupload1
and__DATA_FETCH__
and tools during upload to bring the files into Galaxy as datasets.An instance of the
ConfiguredFileSources
class is responsible for managing individual instances ofFilesSource
plugins. It has methods to map URIs to the appropriate plugin instance.The
ConfiguredFileSources
class tracks the loaded plugins and reuses the go togalaxy.util.plugin_config
module for loading YAML (or XML) definitions of plugins (the same dependency resolvers, job metrics, auth backends, etc. do). AConfiguredFileSources
object can serialize itself to a file and re-materialize it during job execution to allow using this abstraction during uploads.When operating within the Galaxy app, the
ConfiguredFileSources
uses an adapter pattern to parse user-level information from Galaxy'strans
object. During serialization, theConfiguredFileSources
object is expected to encode all the required information about the user that is needed into the output JSON description of the file sources. This is because the web transaction won't be available remotely during the upload job. These objects working in such different ways between the Galaxy process and in the remote job is mildly jarring - so unit tests have been written to ensure this all functions properly.Plugin Implementations
The
FilesSource
interface has a helper implementation base classBaseFilesSource
that provides some assistance for plugin development. Additionally, the base classPyFilesystem2FilesSource
extendsBaseFilesSource
but assumes a PyFilesystem2 implementation exists to target the file source of interest - so the plugin author need only provide a PyFilesystemFS
object describing the target. This commit includes three concrete implementations - posix, webdav, and dropbox.posix
extendsBaseFilesSource
while the others are light-weight extensions ofPyFilesystem2FilesSource
.posix
While one could imagine a very lightweight implementation based on
PyFilesystem2FilesSource
this fully worked through plugin is implemented directly to ensure we respect Galaxy's strong security checks on paths containing symlinks and preserve the semanticsuser_library_import_symlink_allowlist
.webdav
Galaxy tools for integrating OwnCloud exist - see https://github.com/shiltemann/Galaxy-Owncloud-Integration, part of the driver for this work was extending that idea to provide more integrated UX for uploading that data. So this work includes a WebDav plugin (and associated test cases) that could potentially target OwnCloud.
This plugin was a good exercise in flushing and testing the PyFilesystem2 interface but the PyFilesystem2 WebDAV implementation seems a bit fragile... we might want to replace it with more direct APIs but we can take a wait and see approach.
The config YAML for a webdav plugin that lets user's target their own OwnCloud servers configured via user preferences might look something like:
The configuration would provide a user's OwnCloud files at
gxfiles://owncloud1/
.If instead, a big centralized WebDav server is made available with public data for all users (mirroring use cases of
library_import_dir
) - a simpler configuration not requiring user preferences might be something like:The configuration would provide a these WebDAV files at
gxfiles://lab/
.These two examples demonstrate basic templating is allowed inside the YAML configuration. These are Cheetah templates exposing very specific views of the 'user', 'config', and the whole 'environ' available to the Galaxy server.
dropbox
The Dropbox PyFilesystem2 plugin is even easier to configure, all that is needed is a Dropbox access token (this can be configured from the settings menu and may be isolated to a specific app specific folder for added security on the user's part).
An example of such a plugin might be:
The configuration would provide a user's Dropbox files at
gxfiles://dropbox1/
.gxftp
This is an automatically populated plugin (if
ftp_upload_dir
is configured in Galaxy) that provides the user's FTP files atgxftp://
.gximport
This is an automatically populated plugin (if
library_import_dir
is configured in Galaxy) that provides Galaxy's library import files atgximport://
.gxuserimport
This is an automatically populated plugin (if
user_library_import_dir
is configured in Galaxy) that provides the requesting user's Galaxy's user library import files atgxuserimport://
.Why not a tool?
One could imagine a tool - but the upload dialog has many advanced options for selecting how to ingest files (convert tabs and newlines, select format vs. detect, select dbkey, organize into collections, organize via rules, etc...). It would be next to impossible to provide all these same options via a normal tool and the user experience would be very different than using the upload components in Galaxy - which have been optimized and designed for this task.
That said - one future direction I would like to take this is to be able to mark plugins as writable and implement a new tool form input type "export_directory" or something like that. This could then be used to write data export tools. This could be used to write generalizations of the the cloud send tool.
ObjectStore
vsFilesSource
ObjectStores provide datasets not files, the files are organized logically in a very flat way around a dataset.
FilesSource
s instead provide files and directories, not datasets. AFilesSource
is meant to be browsed in hierarchical fashion - and also has no concept of extra files, etc..Future Work
PyFilesytem2
means we could potentially integrate support for S3, Basespace, Google Drive, OneDrive, etc..FilesSource
- this would really enhance the UI around getting big stuff out of Galaxy potentially I think.