AnVIL workspace import abstraction #93

daniaki · 2021-10-12T03:36:38Z

This is a high-level proposal for generalizing the AnVIL workspace import feature to allow:

Support for importing data from other sources such as a GCP project (and potentially Azure in the future).
Customizable form flows allowing flexibility for different use cases when validating pedigree and metadata files.
Co-existence of multiple workspace import forms, with the current AnVIL form remaining as the default strategy.

@hanars, we'd love to hear whether (a) the proposed implementation is the right way forward (b) if this is something that you'd find worth up-streaming.

Motivation

The main motivation is in saving our team's resources and time; we'd like to remove as much hands on guidance as possible when onboarding collaborators and loading their data into seqr.

Context

The onboarding process is kicked off when a collaborator submits an application expressing interest in uploading data into seqr for validation and analysis. The CPG team holds an internal meeting to discuss the appropriateness of starting this collaboration. Assuming acceptance, CPG generates a project and on Google Cloud for the collaborator to upload their sample files to for ingestion. This is a manual process involving a lot of back and forth between the two parties getting the required (pedigree and metadata) files in the correct format without any issues, which is time consuming. The goal of this feature is to automate a lot of the manual validation and communication that needs to take place between a collaborator and the CPG team. We would like this onboarding step to fail fast to save time and resources on expensive validation computations.

The data loading process we require is conceptually similar to the loading data from an AnVIL workspace but diverges in a number of key areas:

The ability to load data from other sources such as GCP projects (or other cloud providers).
Validates but does not create a new project. Instead, notifies the team when a collaborator’s data is ready to be ingested.
Provide immediate feedback indicating errors when individual IDs do not map to a collaborator’s uploaded SAM/BAM/CRAM files, or there are errors in the pedigree and metadata files.

Point (2) addresses a need to decouple validation from the data analysis pipeline provided by the current AnVIL loading implementation since it diverges from that used by the CPG. For example, individual IDs in the pedigree do not always match those in the jointly called GVCF file preventing upload of those samples into seqr. Additionally, metadata fields can be tricky to validate, so having this validation functionality directly in seqr would allow us to validate these fields and provide immediate user feedback (e.g. some columns require very specific types and values). Due to this difficulty, the analysis pipeline is often handled manually by someone in our engineering team.

Workflow

In summary, assuming acceptance, the workflow of this process is summarized by the following steps:

A collaborator loads sample data into GCP bucket.
The collaborator logs into seqr and is asked to complete the onboarding form, which involves uploading metadata and pedigree files for validation, and specifying the mapping of individual IDs to sample files.
The software team is notified of successful completion of onboarding and can begin QC and validation of samples.
Display issues with QC to collaborators.

Implementation

Given the similarity to importing an AnVIL workspace, we could abstract the current implementation to support different workspace providers, form structures and submission callback configurations. The current implementation of loading data from an AnVIL workspace would be a concrete implementation of this abstraction and remain unchained.

Proposed changes:

Abstract the component LoadWorkspaceDataForm with initial concrete implementations for:
1a. AnVIL workspace.
1b. GCP cloud storage workspace.
Dispatch the correct form from LoadWorkspaceData according to a workspace strategy parameter. For example, we could update the create project from workspace route to /create_project_from_workspace/:strategy/:namespace/:name where :strategy can take on a select number of values such as anvil, gcp-cloud, azure-cloud, etc.
Implementation of a multi-step GCP cloud storage workspace import form addressing and validation and subsequent storage of the family metadata file, pedigree file, individual metadata and individual ID to sample file mapping. Please see the following sub-section for detailed requirements of this form.
Implementation of helper utilities for client side validation of:
4a. Pedigree files.
4b. Metadata files.
4c. Family metadata files.
4d. Individual ID to sample mapping.
Implementation of Django configuration settings so that deployments can customize the default import strategy, and other required information such as authorization keys.

Form Requirements

Pedigree upload and validation, with table display and pedigree diagram for parsed file.
Family/individual metadata upload and validation with table display for parsed file.
Editable individual ID to sample input file mapping and validation.
Accepts a configurable API client to interface with a workspace provider to facilitate:
4a. Checking existence of SAM/BAM/CRAM files.
4b. Uploading validated pedigree and metadata files to the collaborators workspace.
4c. Retrieving QC errors, software team comments etc.
Makes use of the BulkUploadForm for each individual step where appropriate.
Configurable onSubmit functionality.

Mock UI

High fidelity mock-ups of what we imagine our GCP workspace import form will look like:

The text was updated successfully, but these errors were encountered:

hanars · 2021-10-13T15:38:47Z

So while this looks like awesome work for you guys, I don't think we would be interested in upstreaming it. Our process for onboarding non-AnVIL groups looks very different from yours, and in fact I think your workflow is so specific to your team that it would not be very extensible to other open source users as well. As such, I think the maintenance overhead for us to keep our AnVIL flow working the way we need to without breaking your workflow would not really be worth it for us

Given that, its really up to you to determine whether you want to "abstract" the AnVIL workflow or just change it to meet your needs . Were I you I would probably choose the latter, as I think it would be more maintainable. If you do decide to go the abstracting route, I'm not really sure how valuable a lot of the abstraction would be. For instance, the LoadWorkspaceDataForm is just a lightweight wrapper around ReduxFormWrapper - since you want to show different fields and have different submit functionality, it seems like it would be much easier to create your own form with the desired behavior, rather than create an abstraction on an abstraction. If we were upstreaming it the abstraction would make sense, but given that we aren't I'm not sure it does

Let me know if you have any other questions or advice as you work on this project though!

daniaki · 2021-10-15T01:55:31Z

Thanks for your suggestions! In light of your feedback, we have decided to implement this feature as an isolated component within our seqr fork so we can continue to merge upstream updates without any major conflicts.

daniaki added the enhancement New feature or request label Oct 12, 2021

daniaki changed the title ~~seqr workspace import abstraction~~ AnVIL workspace import abstraction Oct 12, 2021

daniaki mentioned this issue Oct 27, 2021

Seqr loading wizard #101

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnVIL workspace import abstraction #93

AnVIL workspace import abstraction #93

daniaki commented Oct 12, 2021

hanars commented Oct 13, 2021

daniaki commented Oct 15, 2021

AnVIL workspace import abstraction #93

AnVIL workspace import abstraction #93

Comments

daniaki commented Oct 12, 2021

Motivation

Context

Workflow

Implementation

Proposed changes:

Form Requirements

Mock UI

hanars commented Oct 13, 2021

daniaki commented Oct 15, 2021