Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Upload Provenance and Correctness #12912

Merged
merged 9 commits into from
Nov 17, 2021

Conversation

jmchilton
Copy link
Member

The Problem and Why It is A Priority

This is a series of atomic commits aimed at improving upload transparency and correctness ahead of the larger bulk of my deferred data work.

When people describe deferring evaluation of URLs into datasets as being seemingly simple - one of the many things that makes me uncomfortable about that assertion is that URLs do not map cleanly to datasets. Galaxy's upload configuration options, datatype configuration, user-selected options can all potentially affect how a URL is turned into literal bytes on a disk.

So Galaxy does not know what it did with a URI to produce a dataset, if that dataset is exported in a deferred fashion and re-imported on another Galaxy - how is that Galaxy supposed to know what to do with the URI?

As is frequently the case with these things, I think the fact that Galaxy doesn't track and expose this data has a user facing cost in terms of transparency as well. If a user uploads a file and the spaces are converted tabs, the newlines are transformed, or the BAM content is sorted - we don't expose any of that to the user. The user may believe we are operating directly on the file supplied. I think this is antithetical to our mission.

The Implemented Solution

Back in #7487 I started work on these ideas and added a DatasetSource table with a transform JSON column. The idea was to capture this provenance during upload of URIs and store it there. This PR finally implements that idea and utilizes that field.

The transform column is a JSON type column and will now store a list of actions that must be applied to the source URI to produce the dataset available in Galaxy. The actions are not just the upload parameters - things like newline conversion are tracked and only recorded if they in fact modify the file contents. This I think is much more useful information to supply users.

In addition to tracking this in the database, this PR also provides UI elements on the "Dataset Information" page to display both the source URI as well as the list of transformations applied to the data. For externally available URIs this component includes a link and for all URIs (including File Source URIs) the page includes a copy link button of the URI and the list of transformations to apply to the data.

The following screenshots demonstrate this component and was generated using included Selenium tests:

dataset_details_source_transform_bam_grooming
dataset_details_source_transform_spaces_to_tabs

In addition to these broad strokes around tracking dataset transformations, additional provenance is now displayed as part of dataset information - including components for the created_from_basename field and information about attached DatasetHashes.

Smaller code cleanups, refactoring, added types for upload-related code is included from the deferred data branch as well. This PR also contains an important upload bugfix where the spaces_to_tabs upload configuration parameter was ignored if to_posix_lines is not also enabled.

Downstream Context - Deferred Data

This PR defines and tracks these transformations and makes them useful by exposing them to the user providing greater provenance and traceability. In my downstream work on deferred data, this tracking is also functionally utilized by Galaxy. Model store exports (histories, datasets, invocations, etc...) that are exported without including the dataset files will still include dataset sources and the transformations applied.

When those datasets are imported, they will be put in a new "deferred" state since Galaxy knows how to fetch them and knows what transformations need to be applied to the data to recreate it in the form it was used by the original Galaxy they were exported from.

The code for "materializing" deferred datasets that uses these actions can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R149. Additionally, that branch contains multiple tests of that exact thing - that datasets exported without files and re-imported and materialized do in fact preserve dataset contents. These tests can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-df885087d5bec2a965f2f3c043ec963421c3565b3d0c9fa874bcd0f26ebe8493R146.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.

License

  • I agree to license these contributions under Galaxy's current license.
  • I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

@github-actions github-actions bot added this to the 22.01 milestone Nov 12, 2021
@jmchilton jmchilton added the highlight Included in user-facing release notes at the top label Nov 16, 2021
Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks a lot @jmchilton!

@mvdbeek mvdbeek merged commit c89bcad into galaxyproject:dev Nov 17, 2021
@astrovsky01 astrovsky01 mentioned this pull request Feb 7, 2022
40 tasks
@nsoranzo nsoranzo deleted the upload_provenance_2 branch November 4, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants