Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Pluggable URI handling across upload components.
*Overview* This work defines an interface for interacting with "filesystem"-like entities during "upload". In addition to being a pluggable framework adding important new capabilities to Galaxy, this is a generalization and formalization of existing file sources (e.g. the directories described by `library_import_dir`, `user_library_dir`, and `ftp_upload_dir`). *Plugin Infrastructure* This introduces a new plugin `FilesSource` to represent sources of directories and files during "upload". A `FilesSource` plugin should be able to index directories and download (called 'realize' to be generic) files to local posix directories. Indexing is used by the remote_files API to provide the client with hierarchies to navigate and to build URIs for the files. The 'realize' operation is used by the 'upload1' and '__DATA_FETCH__' and tools during upload to bring the files into Galaxy as datasets. An instance of the `ConfiguredFileSources` class is responsible for managing individual instances of `FilesSource` plugins. It has methods to map URIs to the appropriate plugin instance. The `ConfiguredFileSources` class tracks the loaded plugins and reuses the go to `galaxy.util.plugin_config` module for loading YAML (or XML) definitions of plugins (the same dependency resolvers, job metrics, auth backends, etc. do). A `ConfiguredFileSources` object can serialize itself to a file and re-materialize it during job execution to allow using this abstraction during uploads. When operating within the Galaxy app, the `ConfiguredFileSources` uses an adapter pattern to parse user-level information from Galaxy's `trans` object. During serialization, the `ConfiguredFileSources` object is expected to encode all the required information about the user that is needed into the output JSON description of the file sources. This is because the web transaction won't be available remotely during the upload job. These objects working in such different ways between the Galaxy process and in the remote job is mildly jarring - so unit tests have been written to ensure this all functions properly. *Plugin Implementations* The `FilesSource` interface has a helper implementation base class `BaseFilesSource` that provides some assistance for plugin development. Additionally, the base class `PyFilesystem2FilesSource` extends `BaseFilesSource` but assumes a PyFilesystem2 implementation exists to target the file source of interest - so the plugin author need only provide a PyFilesystem `FS` object describing the target. This commit includes three concrete implementations - posix, webdav, and dropbox. `posix` extends `BaseFilesSource` while the others are light-weight extensions of `PyFilesystem2FilesSource`. **posix** While one could imagine a very lightweight implementation based on `PyFilesystem2FilesSource` this fully worked through plugin is implemented directly to ensure we respect Galaxy's strong security checks on paths containing symlinks and preserve the semantics `user_library_import_symlink_allowlist`. **webdav** Galaxy tools for integrating OwnCloud exist - see https://github.com/shiltemann/Galaxy-Owncloud-Integration, part of the driver for this work was extending that idea to provide more integrated UX for uploading that data. So this work includes a WebDav plugin (and associated test cases) that could potentially target OwnCloud. This plugin was a good exercise in flushing and testing the PyFilesystem2 interface but the PyFilesystem2 WebDAV implementation seems a bit fragile... we might want to replace it with more direct APIs but we can take a wait and see approach. The config YAML for a webdav plugin that lets user's target their own OwnCloud servers configured via user preferences might look something like: ``` - type: webdav id: owncloud1 label: OwnCloud doc: User-configured OwnCloud files url: ${user.preferences['owncloud|url']} login: ${user.preferences['owncloud|username']} password: ${user.preferences['webdav|password']} ``` The configuration would provide a user's OwnCloud files at `gxfiles://owncloud1/`. If instead, a big centralized WebDav server is made available with public data for all users (mirroring use cases of `library_import_dir`) - a simpler configuration not requiring user preferences might be something like: ``` - type: webdav id: lab label: Lab WebDAV server doc: Our lab's research files managed at ourlab.org. url: http://ourlab.org:7083 login: ${environ.get('WEBDAV_LOGIN')} password: ${environ.get('WEBDAV_PASSWORD')} ``` The configuration would provide a these WebDAV files at `gxfiles://lab/`. These two examples demonstrate basic templating is allowed inside the YAML configuration. These are Cheetah templates exposing very specific views of the 'user', 'config', and the whole 'environ' available to the Galaxy server. **dropbox** The Dropbox PyFilesystem2 plugin is even easier to configure, all that is needed is a Dropbox access token (this can be configured from the settings menu and may be isolated to a specific app specific folder for added security on the user's part). An example of such a plugin might be: ``` - type: dropbox id: dropbox1 label: Dropbox Files doc: Your Dropbox files - configure an access token via the user preferences accessToken: ${user.preferences['dropbox|access_token']} ``` The configuration would provide a user's Dropbox files at `gxfiles://dropbox1/`. **gxftp** This is an automatically populated plugin (if `ftp_upload_dir` is configured in Galaxy) that provides the user's FTP files at `gxftp://`. **gximport** This is an automatically populated plugin (if `library_import_dir` is configured in Galaxy) that provides Galaxy's library import files at `gximport://`. **gxuserimport** This is an automatically populated plugin (if `user_library_import_dir` is configured in Galaxy) that provides the requesting user's Galaxy's user library import files at `gximportfiles://`. *Why not a tool?* One could imagine a tool - but the upload dialog has many advanced options for selecting how to ingest files (convert tabs and newlines, select format vs. detect, select dbkey, organize into collections, organize via rules, etc...). It would be next to impossible to provide all these same options via a normal tool and the user experience would be very different than using the upload components in Galaxy - which have been optimized and designed for this task. That said - one future direction I would like to take this is to be able to mark plugins as writable and implement a new tool form input type "export_directory" or something like that. This could then be used to write data export tools. This could be used to write generalizations of the the cloud send tool. *`ObjectStore` vs `FilesSource`* ObjectStores provide datasets not files, the files are organized logically in a very flat way around a dataset. `FilesSource` s instead provide files and directories, not datasets. A `FilesSource` is meant to be browsed in hierarchical fashion - and also has no concept of extra files, etc.. *Future Work* - This is hopefully going to serve as the basis of a first pass at Terra integration with Galaxy using the FISS lib. Having an implementation based on `PyFilesytem2` means we could potentially integrate support for S3, Basespace, Google Drive, OneDrive, etc.. - Tool form support for selecting files for import and directories for export. - Allow writing collection archives, history export, etc.. to the `FilesSource` - this would really enhance the UI around getting big stuff out of Galaxy potentially I think. Rebase into galaxy.files...
- Loading branch information