CZID-7790: Expand cov viz threshold for ONT workflow #307

vincent-czi · 2023-11-21T07:05:38Z

Expands the creation of coverage visualization files in the ONT workflow to happening for any number of NT reads, not just for when a taxon has NT contigs present.

Testing

Most of the work around this ticket was in verifying the results of the change. Here was my process.

I uploaded a long-read sample as a starting place. From here on out, let's call this the OriginalRun The actual coverage viz files that get created are two flavors: the overall summary file coverage_viz_summary.json, and the specific viz files for each accession that we might show to the user ${ACCESSION_ID}coverage_viz.json (eg CP065714.1_coverage_viz.json). The cov viz files are generated in the WDL task GenerateCoverageViz.
I went through the sample report for OriginalRun and identified six different taxa -- let's call these taxa bellwethers -- to use for testing purposes. I chose three bellwethers that had NT contigs, and three bellwethers that had NT reads, but no contigs.
I went through the logs for OriginalRun and pulled the inputs JSON for the GenerateCoverageViz task. I then re-ran that task by itself on my own EC2 instance. Let's call this VanillaRun. I verified that for the three bellwethers with NT contigs, the cov viz JSON files were identical between OriginalRun and VanillaRun. [This was mostly a sanity check: it would have been surprising for them not to match up.]
I made modifications to the code (what you see in this PR, by and large) for the workflow, and re-ran it. Let's call that ModdedRun.
I verified that for the three NT Contig bellwethers, the cov viz JSON files were identical between OriginalRun/VanillaRun and ModdedRun. This shows that the code change had no impact on current functionality, as expected.
I compared the total quantity of cov viz JSONs before and after the change. For this sample, BEFORE was 192 cov viz files, and AFTER was 555 cov viz files. This was mildly surprising: I had expected more in the BEFORE category, since each taxon with a contig can have up to 10 accessions each with its own cov viz, and I didn't expect there to be quite so large a delta moving to AFTER. But these numbers seem sane, so I think everything is working as expected.
For the three NT loose-read only bellwethers (i.e., the non-contig bellwethers), I identified each accession appearing in those taxa, then downloaded all the cov viz JSONs for those accessions. Talking to Katrina, I learned that the source of truth for our cov viz data comes from the gsnap.blast.top.m8 file, so I downloaded that file from the OriginalRun as well. I then manually went through reading the m8 file for those accessions and comparing it to what was present in the corresponding cov viz JSON files. Everything matched as expected. (It's conceivable I missed something; manually reading those files is a bit difficult, but as far as I can tell everything matched as expected. Plus, from what I see in the way the code works, I think this part should either work or fail in a pretty obvious way.)
[Not really testing, per se] I also checked if producing all the additional cov viz for the non-contig taxa added any additional runtime to the GenerateCoverageViz task: it did not. I did three identical input VanillaRuns, and later did three identical input ModdedRuns: the ModdedRuns took no additional time to run. There is some variance within each set, but the average of the ModdedRuns actually wound up being slightly under the average of the VanillaRuns, so we can take that the additional cov viz production adds no time to the process. (This isn't too surprising: the gsnap.blast.top.m8 file gets produced the same either way in a previous task, and the meat of GenerateCoverageViz is pretty much just reading those results out and formatting them to how we structure our cov viz JSON files, no heavy algorithmic stuff in there.)

…iz-tweak

rzlim08

This looks good and ran successfully on staging. Just a nit on the new attribute name. Otherwise looks good to me.

How was this reviewed?

Read through the code
Ran the code and examined output files
Stepped through the code

Are there files that were not reviewed?

Ignored whitespace changes in run.wdl

Are there tests included

Yes
No - and why? - I'm not sure it's possible to test the idseq-dag functions anymore?

rzlim08 · 2023-11-22T20:10:57Z

lib/idseq-dag/idseq_dag/steps/generate_coverage_viz.py

@@ -35,6 +35,7 @@ def run(self):
        max_num_bins_coverage = self.additional_attributes.get("max_num_bins_coverage", MAX_NUM_BINS_COVERAGE)
        num_accessions_per_taxon = self.additional_attributes.get("num_accessions_per_taxon", NUM_ACCESSIONS_PER_TAXON)
        min_contig_size = self.additional_attributes.get("min_contig_size", MIN_CONTIG_SIZE)
+        is_long_read_run = self.additional_attributes.get("is_long_read_run", False)


It might be more straightforward to name this to directly describe what the flag does. i.e. keep_taxons_with_only_reads or something

Good point!

…iz-tweak

vincent-czi added 3 commits November 1, 2023 17:06

WIP: long read runs skip over taxon removal

fb5888b

Finalize changes to cov viz for ONT

d89029f

Merge remote-tracking branch 'origin/main' into vince/czid-7790-ont-v…

8c60d19

…iz-tweak

vincent-czi requested review from morsecodist and rzlim08 November 21, 2023 07:05

morsecodist approved these changes Nov 22, 2023

View reviewed changes

rzlim08 approved these changes Nov 22, 2023

View reviewed changes

vincent-czi added 4 commits November 22, 2023 14:51

Use more descriptive var name

d25e131

Merge remote-tracking branch 'origin/main' into vince/czid-7790-ont-v…

b7c6929

…iz-tweak

Update changelog

5159039

Changelog, fix version to be next one coming out

40288e1

vincent-czi merged commit a46430b into main Dec 1, 2023

vincent-czi deleted the vince/czid-7790-ont-viz-tweak branch December 1, 2023 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CZID-7790: Expand cov viz threshold for ONT workflow #307

CZID-7790: Expand cov viz threshold for ONT workflow #307

vincent-czi commented Nov 21, 2023 •

edited by jira bot

Loading

rzlim08 left a comment

rzlim08 Nov 22, 2023

vincent-czi Nov 22, 2023

CZID-7790: Expand cov viz threshold for ONT workflow #307

CZID-7790: Expand cov viz threshold for ONT workflow #307

Conversation

vincent-czi commented Nov 21, 2023 • edited by jira bot Loading

Testing

rzlim08 left a comment

Choose a reason for hiding this comment

How was this reviewed?

Are there files that were not reviewed?

Are there tests included

rzlim08 Nov 22, 2023

Choose a reason for hiding this comment

vincent-czi Nov 22, 2023

Choose a reason for hiding this comment

vincent-czi commented Nov 21, 2023 •

edited by jira bot

Loading