Improved Upload Provenance and Correctness #12912

jmchilton · 2021-11-12T16:46:50Z

The Problem and Why It is A Priority

This is a series of atomic commits aimed at improving upload transparency and correctness ahead of the larger bulk of my deferred data work.

When people describe deferring evaluation of URLs into datasets as being seemingly simple - one of the many things that makes me uncomfortable about that assertion is that URLs do not map cleanly to datasets. Galaxy's upload configuration options, datatype configuration, user-selected options can all potentially affect how a URL is turned into literal bytes on a disk.

So Galaxy does not know what it did with a URI to produce a dataset, if that dataset is exported in a deferred fashion and re-imported on another Galaxy - how is that Galaxy supposed to know what to do with the URI?

As is frequently the case with these things, I think the fact that Galaxy doesn't track and expose this data has a user facing cost in terms of transparency as well. If a user uploads a file and the spaces are converted tabs, the newlines are transformed, or the BAM content is sorted - we don't expose any of that to the user. The user may believe we are operating directly on the file supplied. I think this is antithetical to our mission.

The Implemented Solution

Back in #7487 I started work on these ideas and added a DatasetSource table with a transform JSON column. The idea was to capture this provenance during upload of URIs and store it there. This PR finally implements that idea and utilizes that field.

The transform column is a JSON type column and will now store a list of actions that must be applied to the source URI to produce the dataset available in Galaxy. The actions are not just the upload parameters - things like newline conversion are tracked and only recorded if they in fact modify the file contents. This I think is much more useful information to supply users.

In addition to tracking this in the database, this PR also provides UI elements on the "Dataset Information" page to display both the source URI as well as the list of transformations applied to the data. For externally available URIs this component includes a link and for all URIs (including File Source URIs) the page includes a copy link button of the URI and the list of transformations to apply to the data.

The following screenshots demonstrate this component and was generated using included Selenium tests:

In addition to these broad strokes around tracking dataset transformations, additional provenance is now displayed as part of dataset information - including components for the created_from_basename field and information about attached DatasetHashes.

Smaller code cleanups, refactoring, added types for upload-related code is included from the deferred data branch as well. This PR also contains an important upload bugfix where the spaces_to_tabs upload configuration parameter was ignored if to_posix_lines is not also enabled.

Downstream Context - Deferred Data

This PR defines and tracks these transformations and makes them useful by exposing them to the user providing greater provenance and traceability. In my downstream work on deferred data, this tracking is also functionally utilized by Galaxy. Model store exports (histories, datasets, invocations, etc...) that are exported without including the dataset files will still include dataset sources and the transformations applied.

When those datasets are imported, they will be put in a new "deferred" state since Galaxy knows how to fetch them and knows what transformations need to be applied to the data to recreate it in the form it was used by the original Galaxy they were exported from.

The code for "materializing" deferred datasets that uses these actions can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R149. Additionally, that branch contains multiple tests of that exact thing - that datasets exported without files and re-imported and materialized do in fact preserve dataset contents. These tests can be found at - https://github.com/galaxyproject/galaxy/pull/12533/files#diff-df885087d5bec2a965f2f3c043ec963421c3565b3d0c9fa874bcd0f26ebe8493R146.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.

License

I agree to license these contributions under Galaxy's current license.
I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

Previously sep2tabs would imply to_posix_lines even if that was explicitly set to false. I think this new behavior is what I intended and is more correct.

I think it is useful provenance information and will encourage better tracking if people complain about that being wrong.

mvdbeek

This is awesome, thanks a lot @jmchilton!

jmchilton added 9 commits November 12, 2021 10:21

Fix when setting sep2tabs but not to_posix_lines.

8865044

Previously sep2tabs would imply to_posix_lines even if that was explicitly set to false. I think this new behavior is what I intended and is more correct.

Even more typing in sniff.py.

9d83e09

Track transformations applied during uploads.

4fe7109

Reuse convert functions.

b840876

Show created_from_basename in the datset information panel in the GUI.

43c464b

I think it is useful provenance information and will encourage better tracking if people complain about that being wrong.

Integrated hashses and sources into various APIs for UI display.

ada2784

Abstraction for easier uploads using the fetch API.

c65a6c5

Display dataset source transform information.

9d18c91

Remove unneeded argument.

b03d7d0

jmchilton added kind/bug kind/enhancement area/UI-UX area/testing/selenium area/upload labels Nov 12, 2021

github-actions bot added this to the 22.01 milestone Nov 12, 2021

jmchilton added the highlight Included in user-facing release notes at the top label Nov 16, 2021

mvdbeek approved these changes Nov 17, 2021

View reviewed changes

mvdbeek merged commit c89bcad into galaxyproject:dev Nov 17, 2021

mvdbeek mentioned this pull request Nov 17, 2021

Tooltips / glossary / help button for advanced concepts #12934

Open

14 tasks

astrovsky01 mentioned this pull request Feb 7, 2022

Release testing 22.01 #13319

Closed

40 tasks

nsoranzo deleted the upload_provenance_2 branch November 4, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Upload Provenance and Correctness #12912

Improved Upload Provenance and Correctness #12912

jmchilton commented Nov 12, 2021

mvdbeek left a comment

Improved Upload Provenance and Correctness #12912

Improved Upload Provenance and Correctness #12912

Conversation

jmchilton commented Nov 12, 2021

The Problem and Why It is A Priority

The Implemented Solution

Downstream Context - Deferred Data

How to test the changes?

License

mvdbeek left a comment

Choose a reason for hiding this comment