-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GAKT 4.1.8.1 - PostprocessGermlineCNVCalls - Error "Records were not strictly sorted in dictionary order" #6924
Comments
Hi @ruslan-abasov, The dictionary file you attached (hg19.dict.txt) is in an idiosyncratic order (chr10-19, chr1, chr20-22, etc.) as are the scattered interval lists. It appears that this dictionary may have been sorted using I'm guessing that there is a discrepancy in dictionaries and that you might use one in the canonical order (chr1-22, etc.) elsewhere in the workflow, especially since the failing shard 4 includes chr18, chr19, and chr1. If If not, there may indeed be a bug lurking somewhere! |
Hi, @samuelklee ! Thanks for the quick response ! I confirm that I used the same dictionary file except maybe CollectReadCounts step. How can i verify contigs order in *.hdf5 files? |
You should be able to open them with a program like HDFView to reveal a directory-like structure; navigating to Now that I look at the history of code changes, it's possible that allowing the use of an inconsistent dictionary might have been introduced in #6330. Previous to this, I believe the sequence dictionary was pulled from files in the model/call directories, but it looks like the ability to override this dictionary with that provided by What happens if you try running without the optional @ldgauthier can you comment? I think I might've been out on leave when this PR went in, so I'm not sure of the context of that work or if we need to allow external dictionaries. Would it be possible to revert and just get the dictionary from the model/call directories as before (and perhaps check that |
Hmm, it looks like we could perform more checks upstream in GermlineCNVCaller as well. I think it might be also possible to introduce inconsistencies there by using read counts and sharded interval lists with different dictionaries or contig orders (which might have happened here). The canonical dictionary is taken from the first read count file, but the intervals pass through the EDIT: Seems like checking intervals for the dictionary may be something we should properly add at the engine level. See e.g. the following
We could also just require a sequence dictionary for the relevant tools and/or somehow try to reconcile with dictionaries from other inputs, but I think it's better to check and fail. |
Sorry, @ruslan-abasov, can you try another program, such as HDF Compass? I vaguely remember a previous user also having trouble with hdfview, perhaps due to the specific version of Java used, but I can't find the post. But see e.g. https://bugs.launchpad.net/ubuntu/+source/hdf5/+bug/1749448. |
As expected, the hdf5 file has a different contigs order. |
Thanks for confirming that @ruslan-abasov. Unfortunately, I would guess that the inconsistency was introduced in the GermlineCNVCaller step. It’s possible that you could edit the files manually so that you don’t have to rerun all GermlineCNVCaller shards; for example, you could check that all dictionaries in the output of the good shards (i.e., those that contain intervals that are correctly ordered with respect to either dictionary) are the correct dictionary used to generate the count files, reshard/reorder the intervals in the failing shards and rerun GermlineCNVCaller, then stitch everything back together with PostprocessGermlineCNVCalls. However, I think this will be a rather delicate surgery and it may be easy to mess up. I would just recommend fresh runs of GermlineCNVCaller with the correct dictionary and an appropriately ordered interval list. I would go so far as to recommend you delete and/or never use that dictionary again—such incorrectly ordered dictionaries are a frequent source of heartbreak! I would say that the code is working as intended and that the error message is sufficiently informative. However, we could certainly fail earlier, before the expensive GermlineCNVCaller step. As mentioned above, we will need to do some work to enable this; I would suggest:
EDIT^2: I think I misremembered the details of how #6330 hooked up the sequence dictionary and how getBestAvailableSequenceDictionary in GATKTool works (which probably explains why that comment was deleted...). Now that I actually go back and look, the Note that after all of this, it will still be possible to get into trouble at the gCNV step if you make funky shards (e.g., you could have shard 1 contain intervals from chr1 and chr3, and shard 2 contain intervals from chr2). I don't think it is possible to check for this case early, but you would still fail at PostprocessGermlineCNVCalls as above. Of course, all of these possibilities can be avoided by simply using the WDL, but it will be good to harden checks for those still working at the command line. @ldgauthier @droazen @mwalker174 what do you think? Happy to review later, but OK if I pass this off to you all? |
Can I replace dictionaries in *hdf5 files ? Will this avoid reruning GermlineCNVCaller step? |
Hi @ruslan-abasov, I believe your GermlineCNVCaller results should have inherited the correct dictionary from the count files. The issue is you created some GermlineCNVCaller shards (e.g., shard 4) with inappropriately ordered intervals (since these were instead ordered w.r.t. to the idiosyncratic dictionary you attached). However, I think if you just reshard and rerun GermlineCNVCaller for any such shards, you may be able to reuse most of your results. For example, you could take your shard 4 interval list, which contains intervals from chr18, chr19, and chr1, and reshard these intervals into two shards: 4a containing chr18-19 intervals, and 4b containing chr1 intervals. After rerunning 4a and 4b through GermlineCNVCaller, you should be able to use PostprocessGermlineCNVCalls to stitch together shards 4b, 1, 2, 3, and 4a, since these will be ordered w.r.t. the correct dictionary from the count files (i.e, they will contain intervals in the order chr1, chr10-19). Of course, you will want not want to perform this exact procedure; you'll want to generalize it to whatever will yield the correct order for all 10 of your shards across all contigs. Again, this may be error prone and I can't guarantee that it will be successful, since I haven't tried it myself. I would personally just rerun the pipeline. You might be able to cut down on runtime by using smaller shards (I believe we typically shard the entire genome into far more than 10 shards, which we usually run in parallel) and making sure you set parameters appropriately for WGS. @mwalker174 has the most experience running on WGS and should be able to provide you the latest recommendations, or you might be able to find them by searching GitHub or the GATK Forums. Your point is well taken about failing earlier, and I think I've outlined the best strategy above. It is impossible to catch all possible errors early, but for some we can certainly fail before the GermlineCNVCaller step. |
#6055 Bug Report
Affected tool(s) or class(es)
PostprocessGermlineCNVCalls
Affected version(s)
4.1.8.1
Description
Process of calling copy number segments and consolidate sample results with PostprocessGermlineCNVCalls aborts with error : "Records were not strictly sorted in dictionary order."
I tried to detect germline copy number in cohort mode on 73 wgs samples by this tutorial.To do this, i split genome into 10 parts.At the last stage PostprocessGermlineCNVCalls aborts with error : "Records were not strictly sorted in dictionary order", when i use all parts.
scattered.1-10.interval_list.txt
scattered.2-10.interval_list.txt
scattered.3-10.interval_list.txt
scattered.4-10.interval_list.txt
hg19.dict.txt
But if you use the first three, the program works out. -
12:43:27.310 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 29, 2020 12:43:27 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.439 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.8.1 12:43:27.439 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 12:43:27.439 INFO PostprocessGermlineCNVCalls - Executing as lmbs02@Lmbs01 on Linux v5.4.0-48-generic amd64 12:43:27.439 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 12:43:27.439 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 29, 2020 12:43:27 PM MSK 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.439 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.23.0 12:43:27.440 INFO PostprocessGermlineCNVCalls - Picard Version: 2.22.8 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 12:43:27.440 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 12:43:27.440 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 12:43:27.440 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 12:43:27.440 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 12:43:27.440 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 12:43:27.440 INFO PostprocessGermlineCNVCalls - Initializing engine 12:43:31.326 INFO PostprocessGermlineCNVCalls - Done initializing engine 12:43:34.892 INFO ProgressMeter - Starting traversal 12:43:34.892 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 12:43:34.893 INFO ProgressMeter - unmapped 0.0 0 NaN 12:43:34.893 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes. 12:43:34.893 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file... 12:43:34.915 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.vcf... 12:43:34.916 INFO PostprocessGermlineCNVCalls - Analyzing shard 1 / 3... 12:43:37.965 INFO PostprocessGermlineCNVCalls - Analyzing shard 2 / 3... 12:43:40.679 INFO PostprocessGermlineCNVCalls - Analyzing shard 3 / 3... 12:43:43.248 INFO PostprocessGermlineCNVCalls - Generating segments VCF file... 12:44:50.045 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.segments.vcf... 12:44:50.129 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios... 12:44:50.928 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19/vcfs/sample.copy_ratios.tsv... 12:44:51.493 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete. 12:44:51.493 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 29, 2020 12:44:51 PM MSK] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 1.40 minutes. Runtime.totalMemory()=4294443008 Using GATK jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar PostprocessGermlineCNVCalls --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-model/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-calls/ --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_ploidy//cohort-calls/ --sample-index 16 --output-genotyped-intervals /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.intervals.vcf --output-genotyped-segments /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.segments.vcf --sequence-dictionary /home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict --output-denoised-copy-ratios /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.copy_ratios.tsv
.In the same time, if you use the first 4 or the third and 4 at the same time, an error pops up.
12:49:08.552 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 29, 2020 12:49:08 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.687 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.8.1 12:49:08.687 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 12:49:08.687 INFO PostprocessGermlineCNVCalls - Executing as lmbs02@Lmbs01 on Linux v5.4.0-48-generic amd64 12:49:08.687 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 12:49:08.687 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 29, 2020 12:49:08 PM MSK 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.687 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.23.0 12:49:08.688 INFO PostprocessGermlineCNVCalls - Picard Version: 2.22.8 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 12:49:08.688 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 12:49:08.688 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 12:49:08.688 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 12:49:08.688 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 12:49:08.688 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 12:49:08.688 INFO PostprocessGermlineCNVCalls - Initializing engine 12:49:12.598 INFO PostprocessGermlineCNVCalls - Done initializing engine 12:49:15.678 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 29, 2020 12:49:15 PM MSK] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 0.12 minutes. Runtime.totalMemory()=2457862144 java.lang.IllegalArgumentException: Records were not strictly sorted in dictionary order. at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateIntervals(CopyNumberArgumentValidationUtils.java:60) at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.getShardedCollectionSortOrder(AbstractLocatableCollection.java:142) at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.onTraversalStart(PostprocessGermlineCNVCalls.java:297) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1047) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Using GATK jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/lmbs02/bio/biosoft/gatk/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar PostprocessGermlineCNVCalls --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-model/ --model-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-model/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0001_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0002_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0003_of_10/cohort-calls/ --calls-shard-path /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_calls//temp_0004_of_10/cohort-calls/ --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//cohort_ploidy//cohort-calls/ --sample-index 16 --output-genotyped-intervals /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.intervals.vcf --output-genotyped-segments /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.segments.vcf --sequence-dictionary /home/lmbs02/bio/databases/referenses/hg19_37/ucsc/hg19.dict --output-denoised-copy-ratios /home/lmbs02/bio/work/gatk_cnv/genomed_genome_hg19//vcfs//sample.copy_ratios.tsv
The text was updated successfully, but these errors were encountered: