Improved Upload Provenance and Correctness #12912
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Problem and Why It is A Priority
This is a series of atomic commits aimed at improving upload transparency and correctness ahead of the larger bulk of my deferred data work.
When people describe deferring evaluation of URLs into datasets as being seemingly simple - one of the many things that makes me uncomfortable about that assertion is that URLs do not map cleanly to datasets. Galaxy's upload configuration options, datatype configuration, user-selected options can all potentially affect how a URL is turned into literal bytes on a disk.
So Galaxy does not know what it did with a URI to produce a dataset, if that dataset is exported in a deferred fashion and re-imported on another Galaxy - how is that Galaxy supposed to know what to do with the URI?
As is frequently the case with these things, I think the fact that Galaxy doesn't track and expose this data has a user facing cost in terms of transparency as well. If a user uploads a file and the spaces are converted tabs, the newlines are transformed, or the BAM content is sorted - we don't expose any of that to the user. The user may believe we are operating directly on the file supplied. I think this is antithetical to our mission.
The Implemented Solution
Back in #7487 I started work on these ideas and added a DatasetSource table with a transform JSON column. The idea was to capture this provenance during upload of URIs and store it there. This PR finally implements that idea and utilizes that field.
The transform column is a JSON type column and will now store a list of actions that must be applied to the source URI to produce the dataset available in Galaxy. The actions are not just the upload parameters - things like newline conversion are tracked and only recorded if they in fact modify the file contents. This I think is much more useful information to supply users.
In addition to tracking this in the database, this PR also provides UI elements on the "Dataset Information" page to display both the source URI as well as the list of transformations applied to the data. For externally available URIs this component includes a link and for all URIs (including File Source URIs) the page includes a copy link button of the URI and the list of transformations to apply to the data.
The following screenshots demonstrate this component and was generated using included Selenium tests:
In addition to these broad strokes around tracking dataset transformations, additional provenance is now displayed as part of dataset information - including components for the created_from_basename field and information about attached DatasetHashes.
Smaller code cleanups, refactoring, added types for upload-related code is included from the deferred data branch as well. This PR also contains an important upload bugfix where the spaces_to_tabs upload configuration parameter was ignored if to_posix_lines is not also enabled.
Downstream Context - Deferred Data
This PR defines and tracks these transformations and makes them useful by exposing them to the user providing greater provenance and traceability. In my downstream work on deferred data, this tracking is also functionally utilized by Galaxy. Model store exports (histories, datasets, invocations, etc...) that are exported without including the dataset files will still include dataset sources and the transformations applied.
When those datasets are imported, they will be put in a new "deferred" state since Galaxy knows how to fetch them and knows what transformations need to be applied to the data to recreate it in the form it was used by the original Galaxy they were exported from.
The code for "materializing" deferred datasets that uses these actions can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R149. Additionally, that branch contains multiple tests of that exact thing - that datasets exported without files and re-imported and materialized do in fact preserve dataset contents. These tests can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-df885087d5bec2a965f2f3c043ec963421c3565b3d0c9fa874bcd0f26ebe8493R146.
How to test the changes?
(Select all options that apply)
License