Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CZID-7790: Expand cov viz threshold for ONT workflow #307

Merged
merged 7 commits into from
Dec 1, 2023

Conversation

vincent-czi
Copy link
Contributor

@vincent-czi vincent-czi commented Nov 21, 2023

CZID-7790

Expands the creation of coverage visualization files in the ONT workflow to happening for any number of NT reads, not just for when a taxon has NT contigs present.

Testing

Most of the work around this ticket was in verifying the results of the change. Here was my process.

  1. I uploaded a long-read sample as a starting place. From here on out, let's call this the OriginalRun The actual coverage viz files that get created are two flavors: the overall summary file coverage_viz_summary.json, and the specific viz files for each accession that we might show to the user ${ACCESSION_ID}coverage_viz.json (eg CP065714.1_coverage_viz.json). The cov viz files are generated in the WDL task GenerateCoverageViz.
  2. I went through the sample report for OriginalRun and identified six different taxa -- let's call these taxa bellwethers -- to use for testing purposes. I chose three bellwethers that had NT contigs, and three bellwethers that had NT reads, but no contigs.
  3. I went through the logs for OriginalRun and pulled the inputs JSON for the GenerateCoverageViz task. I then re-ran that task by itself on my own EC2 instance. Let's call this VanillaRun. I verified that for the three bellwethers with NT contigs, the cov viz JSON files were identical between OriginalRun and VanillaRun. [This was mostly a sanity check: it would have been surprising for them not to match up.]
  4. I made modifications to the code (what you see in this PR, by and large) for the workflow, and re-ran it. Let's call that ModdedRun.
  5. I verified that for the three NT Contig bellwethers, the cov viz JSON files were identical between OriginalRun/VanillaRun and ModdedRun. This shows that the code change had no impact on current functionality, as expected.
  6. I compared the total quantity of cov viz JSONs before and after the change. For this sample, BEFORE was 192 cov viz files, and AFTER was 555 cov viz files. This was mildly surprising: I had expected more in the BEFORE category, since each taxon with a contig can have up to 10 accessions each with its own cov viz, and I didn't expect there to be quite so large a delta moving to AFTER. But these numbers seem sane, so I think everything is working as expected.
  7. For the three NT loose-read only bellwethers (i.e., the non-contig bellwethers), I identified each accession appearing in those taxa, then downloaded all the cov viz JSONs for those accessions. Talking to Katrina, I learned that the source of truth for our cov viz data comes from the gsnap.blast.top.m8 file, so I downloaded that file from the OriginalRun as well. I then manually went through reading the m8 file for those accessions and comparing it to what was present in the corresponding cov viz JSON files. Everything matched as expected. (It's conceivable I missed something; manually reading those files is a bit difficult, but as far as I can tell everything matched as expected. Plus, from what I see in the way the code works, I think this part should either work or fail in a pretty obvious way.)
  8. [Not really testing, per se] I also checked if producing all the additional cov viz for the non-contig taxa added any additional runtime to the GenerateCoverageViz task: it did not. I did three identical input VanillaRuns, and later did three identical input ModdedRuns: the ModdedRuns took no additional time to run. There is some variance within each set, but the average of the ModdedRuns actually wound up being slightly under the average of the VanillaRuns, so we can take that the additional cov viz production adds no time to the process. (This isn't too surprising: the gsnap.blast.top.m8 file gets produced the same either way in a previous task, and the meat of GenerateCoverageViz is pretty much just reading those results out and formatting them to how we structure our cov viz JSON files, no heavy algorithmic stuff in there.)

Copy link
Collaborator

@rzlim08 rzlim08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and ran successfully on staging. Just a nit on the new attribute name. Otherwise looks good to me.

How was this reviewed?

  • Read through the code
  • Ran the code and examined output files
  • Stepped through the code

Are there files that were not reviewed?

  • Ignored whitespace changes in run.wdl

Are there tests included

  • Yes
  • No - and why? - I'm not sure it's possible to test the idseq-dag functions anymore?

@@ -35,6 +35,7 @@ def run(self):
max_num_bins_coverage = self.additional_attributes.get("max_num_bins_coverage", MAX_NUM_BINS_COVERAGE)
num_accessions_per_taxon = self.additional_attributes.get("num_accessions_per_taxon", NUM_ACCESSIONS_PER_TAXON)
min_contig_size = self.additional_attributes.get("min_contig_size", MIN_CONTIG_SIZE)
is_long_read_run = self.additional_attributes.get("is_long_read_run", False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more straightforward to name this to directly describe what the flag does. i.e. keep_taxons_with_only_reads or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@vincent-czi vincent-czi merged commit a46430b into main Dec 1, 2023
@vincent-czi vincent-czi deleted the vince/czid-7790-ont-viz-tweak branch December 1, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants