Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAKT 4.1.8.1 - PostprocessGermlineCNVCalls - Error "Records were not strictly sorted in dictionary order" #6924

Open
ruslan-abasov opened this issue Oct 29, 2020 · 10 comments

Comments

@ruslan-abasov
Copy link

ruslan-abasov commented Oct 29, 2020

#6055 Bug Report

Affected tool(s) or class(es)

PostprocessGermlineCNVCalls

Affected version(s)

4.1.8.1

Description

Process of calling copy number segments and consolidate sample results with PostprocessGermlineCNVCalls aborts with error : "Records were not strictly sorted in dictionary order."
I tried to detect germline copy number in cohort mode on 73 wgs samples by this tutorial.To do this, i split genome into 10 parts.At the last stage PostprocessGermlineCNVCalls aborts with error : "Records were not strictly sorted in dictionary order", when i use all parts.
scattered.1-10.interval_list.txt
scattered.2-10.interval_list.txt
scattered.3-10.interval_list.txt
scattered.4-10.interval_list.txt
hg19.dict.txt

But if you use the first three, the program works out. -
12:43:27.310 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 29, 2020 12:43:27 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.439 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.8.1 12:43:27.439 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 12:43:27.439 INFO PostprocessGermlineCNVCalls - Executing as lmbs02@Lmbs01 on Linux v5.4.0-48-generic amd64 12:43:27.439 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 12:43:27.439 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 29, 2020 12:43:27 PM MSK 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.23.0 12:43:27.440 INFO PostprocessGermlineCNVCalls - Picard Version: 2.22.8 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 12:43:27.440 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 12:43:27.440 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 12:43:27.440 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 12:43:27.440 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 12:43:27.440 INFO PostprocessGermlineCNVCalls - Initializing engine 12:43:31.326 INFO PostprocessGermlineCNVCalls - Done initializing engine 12:43:34.892 INFO ProgressMeter - Starting traversal 12:43:34.892 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 12:43:34.893 INFO ProgressMeter - unmapped 0.0 0 NaN 12:43:34.893 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes. 12:43:34.893 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file... 12:43:34.915 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.vcf... 12:43:34.916 INFO PostprocessGermlineCNVCalls - Analyzing shard 1 / 3... 12:43:37.965 INFO PostprocessGermlineCNVCalls - Analyzing shard 2 / 3... 12:43:40.679 INFO PostprocessGermlineCNVCalls - Analyzing shard 3 / 3... 12:43:43.248 INFO PostprocessGermlineCNVCalls - Generating segments VCF file... 12:44:50.045 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.segments.vcf... 12:44:50.129 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios... 12:44:50.928 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.copy_ratios.tsv... 12:44:51.493 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete. 12:44:51.493 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 29, 2020 12:44:51 PM MSK] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 1.40 minutes. Runtime.totalMemory()=4294443008 Using GATK jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar PostprocessGermlineCNVCalls --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-model/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-calls/ --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_ploidy//cohort-calls/ --sample-index 16 --output-genotyped-intervals /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.intervals.vcf --output-genotyped-segments /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.segments.vcf --sequence-dictionary /home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict --output-denoised-copy-ratios /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.copy_ratios.tsv .

In the same time, if you use the first 4 or the third and 4 at the same time, an error pops up.
12:49:08.552 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 29, 2020 12:49:08 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.687 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.8.1 12:49:08.687 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 12:49:08.687 INFO PostprocessGermlineCNVCalls - Executing as lmbs02@Lmbs01 on Linux v5.4.0-48-generic amd64 12:49:08.687 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 12:49:08.687 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 29, 2020 12:49:08 PM MSK 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.23.0 12:49:08.688 INFO PostprocessGermlineCNVCalls - Picard Version: 2.22.8 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 12:49:08.688 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 12:49:08.688 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 12:49:08.688 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 12:49:08.688 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 12:49:08.688 INFO PostprocessGermlineCNVCalls - Initializing engine 12:49:12.598 INFO PostprocessGermlineCNVCalls - Done initializing engine 12:49:15.678 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 29, 2020 12:49:15 PM MSK] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 0.12 minutes. Runtime.totalMemory()=2457862144 java.lang.IllegalArgumentException: Records were not strictly sorted in dictionary order. at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateIntervals(CopyNumberArgumentValidationUtils.java:60) at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.getShardedCollectionSortOrder(AbstractLocatableCollection.java:142) at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.onTraversalStart(PostprocessGermlineCNVCalls.java:297) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1047) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Using GATK jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar PostprocessGermlineCNVCalls --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-model/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-calls/ --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_ploidy//cohort-calls/ --sample-index 16 --output-genotyped-intervals /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.intervals.vcf --output-genotyped-segments /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.segments.vcf --sequence-dictionary /home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict --output-denoised-copy-ratios /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.copy_ratios.tsv

@ruslan-abasov ruslan-abasov changed the title Records were not strictly sorted in dictionary order GAKT 4.1.81 - PostprocessGermlineCNVCalls - Error "Records were not strictly sorted in dictionary order" Oct 29, 2020
@ruslan-abasov ruslan-abasov changed the title GAKT 4.1.81 - PostprocessGermlineCNVCalls - Error "Records were not strictly sorted in dictionary order" GAKT 4.1.8.1 - PostprocessGermlineCNVCalls - Error "Records were not strictly sorted in dictionary order" Oct 29, 2020
@samuelklee
Copy link
Contributor

samuelklee commented Oct 29, 2020

Hi @ruslan-abasov,

The dictionary file you attached (hg19.dict.txt) is in an idiosyncratic order (chr10-19, chr1, chr20-22, etc.) as are the scattered interval lists. It appears that this dictionary may have been sorted using bash sort or a similar command at some point. Can you confirm that this is the exact same dictionary you provide to the PostprocessGermlineCNVCalls --sequence-dictionary argument (/home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict)?

I'm guessing that there is a discrepancy in dictionaries and that you might use one in the canonical order (chr1-22, etc.) elsewhere in the workflow, especially since the failing shard 4 includes chr18, chr19, and chr1. If /home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict is in canonical order, I would expect that running shards 1-3 would be fine, since they contain intervals chr10-18 in the canonical order. You should ensure that the dictionaries and interval lists you use in the entire workflow are consistent, including the PreprocessIntervals and CollectReadCounts steps.

If not, there may indeed be a bug lurking somewhere!

@ruslan-abasov
Copy link
Author

Hi, @samuelklee ! Thanks for the quick response ! I confirm that I used the same dictionary file except maybe CollectReadCounts step. How can i verify contigs order in *.hdf5 files?

@samuelklee
Copy link
Contributor

You should be able to open them with a program like HDFView to reveal a directory-like structure; navigating to locatable_metadata/sequence_dictionary will give you the answer.

Now that I look at the history of code changes, it's possible that allowing the use of an inconsistent dictionary might have been introduced in #6330. Previous to this, I believe the sequence dictionary was pulled from files in the model/call directories, but it looks like the ability to override this dictionary with that provided by --sequence-dictionary was added here.

What happens if you try running without the optional --sequence-dictionary argument? (It's possible that this still won't fix anything if you used an inconsistent dictionary in CollectReadCounts, but I'm just curious to see what happens.)

@ldgauthier can you comment? I think I might've been out on leave when this PR went in, so I'm not sure of the context of that work or if we need to allow external dictionaries. Would it be possible to revert and just get the dictionary from the model/call directories as before (and perhaps check that --sequence-dictionary is not provided), but keep the indexing?

@ruslan-abasov
Copy link
Author

Couldn't open hdf5 files.
Screenshot_2020-10-29_17-20-17
Running without the optional --sequence-dictionary argument also causes an error.
17:00:00.556 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmbs02/bio/biosoft/gatk/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 29, 2020 5:00:00 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 17:00:00.683 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:00:00.684 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.9.0 17:00:00.684 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:00:00.684 INFO PostprocessGermlineCNVCalls - Executing as lmbs02@Lmbs01 on Linux v5.4.0-48-generic amd64 17:00:00.684 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 17:00:00.684 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 29, 2020 5:00:00 PM MSK 17:00:00.684 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:00:00.684 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:00:00.684 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.23.0 17:00:00.684 INFO PostprocessGermlineCNVCalls - Picard Version: 2.23.3 17:00:00.684 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:00:00.685 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:00:00.685 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:00:00.685 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:00:00.685 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 17:00:00.685 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 17:00:00.685 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 17:00:00.685 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 17:00:00.685 INFO PostprocessGermlineCNVCalls - Initializing engine 17:00:04.480 INFO PostprocessGermlineCNVCalls - Done initializing engine 17:00:07.582 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 29, 2020 5:00:07 PM MSK] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 0.12 minutes. Runtime.totalMemory()=2468347904 java.lang.IllegalArgumentException: Records were not strictly sorted in dictionary order. at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateIntervals(CopyNumberArgumentValidationUtils.java:60) at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.getShardedCollectionSortOrder(AbstractLocatableCollection.java:142) at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.onTraversalStart(PostprocessGermlineCNVCalls.java:297) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1047) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Using GATK jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar PostprocessGermlineCNVCalls --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-model/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-calls/ --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_ploidy//cohort-calls/ --sample-index 16 --output-genotyped-intervals /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.intervals.vcf --output-genotyped-segments /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.segments.vcf --output-denoised-copy-ratios /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.copy_ratios.tsv

@samuelklee
Copy link
Contributor

samuelklee commented Oct 29, 2020

Hmm, it looks like we could perform more checks upstream in GermlineCNVCaller as well. I think it might be also possible to introduce inconsistencies there by using read counts and sharded interval lists with different dictionaries or contig orders (which might have happened here). The canonical dictionary is taken from the first read count file, but the intervals pass through the -L machinery and I think any dictionary information is not checked (probably because it will not be present if Picard interval lists were not used as input); intervals are also just (re)sorted by the read-count dictionary to use in subsetting counts.

EDIT: Seems like checking intervals for the dictionary may be something we should properly add at the engine level. See e.g. the following TODO:

   /**
     * Returns the "best available" sequence dictionary or {@code null} if there is no single best dictionary.
     *
     * The algorithm for selecting the best dictionary is as follows:
     * 1) If a master sequence dictionary was specified, use that dictionary
     * 2) if there is a reference, then the best dictionary is the reference sequence dictionary
     * 3) Otherwise, if there are reads, then the best dictionary is the sequence dictionary constructed from the reads.
     * 4) Otherwise, if there are features and the feature data source has only one dictionary, then that one is the best dictionary.
     * 5) Otherwise, the result is {@code null}.
     *
     * TODO: check interval file(s) as well for a sequence dictionary
     *     ...
     * @return best available sequence dictionary given our inputs or {@code null} if no one dictionary is the best one.
     */
    public SAMSequenceDictionary getBestAvailableSequenceDictionary() {

We could also just require a sequence dictionary for the relevant tools and/or somehow try to reconcile with dictionaries from other inputs, but I think it's better to check and fail.

@samuelklee
Copy link
Contributor

samuelklee commented Oct 29, 2020

Sorry, @ruslan-abasov, can you try another program, such as HDF Compass? I vaguely remember a previous user also having trouble with hdfview, perhaps due to the specific version of Java used, but I can't find the post. But see e.g. https://bugs.launchpad.net/ubuntu/+source/hdf5/+bug/1749448.

@ruslan-abasov
Copy link
Author

As expected, the hdf5 file has a different contigs order.
HDF5 "sample.counts.hdf5" { GROUP "/locatable_metadata/" { DATASET "sequence_dictionary" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): "@HD VN:1.6 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 " } } } }

@samuelklee
Copy link
Contributor

samuelklee commented Oct 30, 2020

Thanks for confirming that @ruslan-abasov. Unfortunately, I would guess that the inconsistency was introduced in the GermlineCNVCaller step.

It’s possible that you could edit the files manually so that you don’t have to rerun all GermlineCNVCaller shards; for example, you could check that all dictionaries in the output of the good shards (i.e., those that contain intervals that are correctly ordered with respect to either dictionary) are the correct dictionary used to generate the count files, reshard/reorder the intervals in the failing shards and rerun GermlineCNVCaller, then stitch everything back together with PostprocessGermlineCNVCalls.

However, I think this will be a rather delicate surgery and it may be easy to mess up. I would just recommend fresh runs of GermlineCNVCaller with the correct dictionary and an appropriately ordered interval list. I would go so far as to recommend you delete and/or never use that dictionary again—such incorrectly ordered dictionaries are a frequent source of heartbreak!

I would say that the code is working as intended and that the error message is sufficiently informative. However, we could certainly fail earlier, before the expensive GermlineCNVCaller step. As mentioned above, we will need to do some work to enable this; I would suggest:

  1. we enable passing of dictionaries from -L Picard interval lists at the engine level (and I would add consistency checks if multiple interval lists are provided here as well),
  2. we add checks to all relevant gCNV tools of read-count dictionaries against the intervals dictionary,
  3. we change the behavior of CopyNumberArgumentValidationUtils.resolveIntervals so that it fails if provided an unsorted IAC, rather than sorting the contained intervals w.r.t. the count dictionary upon creation of the returned SimpleIntervalCollection (this can be done independently of the first two items and would have caused the failing shard to fail earlier; however, the other two items are required to cause all shards to fail earlier),
  4. we revert the change made to PostprocessGermlineCNVCalls, so that external dictionaries provided via --sequence-dictionary do not override those in the count files, and perhaps fail if one is provided for any of the tools (I don’t recall exactly how VCF indexing is triggered by providing one, as seems to be indicated by the tutorial, but hopefully we can disallow external dictionaries while still taking advantage of the relevant engine features for VCF writing). EDIT: Went digging in Slack to try to remind myself of the context of these changes, and found the following PR comment from 1/7 (although it seems to have mysteriously disappeared from GitHub):

Just so I understand, are we allowing overriding of the sequence dictionary in the shards (and skipping the consistency check) by allowing the parameter --sequence-dictionary to be specified? If so, we might want to document. Otherwise, I'd be inclined to enforce using the sequence dictionary in the shards (and ensuring the consistency check across shards is performed) by changing the null check in getBestAvailableSequenceDictionary to a check that the dictionary has not been set via the command line.

EDIT^2: I think I misremembered the details of how #6330 hooked up the sequence dictionary and how getBestAvailableSequenceDictionary in GATKTool works (which probably explains why that comment was deleted...). Now that I actually go back and look, the --sequence-dictionary is not hooked up at all, so there is no change to revert in point 4!

Note that after all of this, it will still be possible to get into trouble at the gCNV step if you make funky shards (e.g., you could have shard 1 contain intervals from chr1 and chr3, and shard 2 contain intervals from chr2). I don't think it is possible to check for this case early, but you would still fail at PostprocessGermlineCNVCalls as above.

Of course, all of these possibilities can be avoided by simply using the WDL, but it will be good to harden checks for those still working at the command line.

@ldgauthier @droazen @mwalker174 what do you think? Happy to review later, but OK if I pass this off to you all?

@ruslan-abasov
Copy link
Author

Can I replace dictionaries in *hdf5 files ? Will this avoid reruning GermlineCNVCaller step?
It's just that this step took quite a long time at my machine. About 42 hours for each shard.....
This is why it seems to me that reporting the error in the early stages would be more useful.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 1, 2020

Hi @ruslan-abasov,

I believe your GermlineCNVCaller results should have inherited the correct dictionary from the count files. The issue is you created some GermlineCNVCaller shards (e.g., shard 4) with inappropriately ordered intervals (since these were instead ordered w.r.t. to the idiosyncratic dictionary you attached). However, I think if you just reshard and rerun GermlineCNVCaller for any such shards, you may be able to reuse most of your results.

For example, you could take your shard 4 interval list, which contains intervals from chr18, chr19, and chr1, and reshard these intervals into two shards: 4a containing chr18-19 intervals, and 4b containing chr1 intervals. After rerunning 4a and 4b through GermlineCNVCaller, you should be able to use PostprocessGermlineCNVCalls to stitch together shards 4b, 1, 2, 3, and 4a, since these will be ordered w.r.t. the correct dictionary from the count files (i.e, they will contain intervals in the order chr1, chr10-19). Of course, you will want not want to perform this exact procedure; you'll want to generalize it to whatever will yield the correct order for all 10 of your shards across all contigs.

Again, this may be error prone and I can't guarantee that it will be successful, since I haven't tried it myself. I would personally just rerun the pipeline. You might be able to cut down on runtime by using smaller shards (I believe we typically shard the entire genome into far more than 10 shards, which we usually run in parallel) and making sure you set parameters appropriately for WGS. @mwalker174 has the most experience running on WGS and should be able to provide you the latest recommendations, or you might be able to find them by searching GitHub or the GATK Forums.

Your point is well taken about failing earlier, and I think I've outlined the best strategy above. It is impossible to catch all possible errors early, but for some we can certainly fail before the GermlineCNVCaller step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants