Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ICA Pipeline Robustness Against Failures #1118

Open
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

michael-harper
Copy link
Contributor

@michael-harper michael-harper commented Jan 24, 2025

This PR addresses challenges in improving the robustness and usability of the DRAGEN ICA pipeline, particularly when managing stage dependencies and output file checks. The current system’s reliance on the existence of all outputs from a previous stage makes it difficult to handle scenarios like canceling a pipeline run when only partial outputs (e.g., pipeline ID) are available.

Slack thread here

Problem

When monitoring the status of an ICA pipeline run the monitor stage wrote it's own output. However, in the event of a pipeline failure by ICA, the Monitor stage would write it's output and when we went to re-run that stage, the output would exist and so the stage would not get triggered to run again. This leaves us in a position where we have to manually go back and delete a stages outputs in GCP just so we can re-run a pipeline...not ideal even if the failure rates are low. So the option to merge the monitor and align+genotype stages into one stage and have them be job dependencies was adopted and this combined stage have two expected_outputs. However, this precipitated a similar issue with the cancellation stage (a stage designed to cancel a pipeline run if the user wants to). This is detailed as follows:

For example, the CancelIcaPipelineRun stage depends on the AlignGenotypeWithDragen stage. This means it won’t execute unless all outputs from AlignGenotypeWithDragen—including _success.json and _pipeline_id.json—exist. This behavior prevents the cancellation of a pipeline when only the pipeline ID is available.

Proposed Solutions

This PR implements a solution which involves merging CancelIcaPipelineRun into AlignGenotypeWithDragen (and renaming the stage to ManageDragenPipeline). The combined stage now:
1. Handles pipeline aligning+genotyping, monitoring and cancellation together.
2. Triggers cancellation only if the configuration enables it (e.g., [ica.pipelines.cancel_cohort_run] = false).
3. Simplifies dependencies, converting the former stage relationship into job dependencies, making it easier to manage file existence checks.

MattWellie and others added 29 commits January 14, 2025 10:30
…ts passed a PythonResult or read in from previous json file as a string
…ts passed a PythonResult or read in from previous json file as a string
…t to Path object instead of string so that can be used in Cancel stage
…eWithDragen if we want to resume a monitoring stage
…urn if requested to cancel via config as we don't want to generate Align or Monitor jobs if so
…capsulate how the stage is a catch-all for launching, monitoring, and cancelling pipeline runs
…'status' of the file no longer seem to be nested inside 'id'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants