GenomicsDB: max # of alleles should be configurable #2687

droazen · 2017-05-09T18:38:52Z

We're seeing messages like the following when running GenomicsDBImport:

Column 948660 has too many alleles in the combined VCF record : 61 : current limit : 50. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Column 948710 has too many alleles in the combined VCF record : 83 : current limit : 50. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

Is this limit of 50 configurable, if we wanted to raise it, and if not, could it be made configurable?

The text was updated successfully, but these errors were encountered:

droazen · 2017-05-09T18:39:07Z

Assigning to @kgururaj and @kdatta for an answer.

droazen · 2017-05-09T19:10:14Z

@lbergelson and @ldgauthier have confirmed that GATK CombineGVCFs (the predecessor to GenomicsDB) also had this same limit, so GenomicsDB is not doing anything radically new here. This ticket is just to ensure that the limit is configurable if it already isn't

kgururaj · 2017-05-09T20:56:45Z

Will add a variable to our Protobuf configuration object - the JSON already an option to set this.

… sites-only query support, and bug fixes (#4645) This PR addresses required changes in order to use latest version of GenomicsDB which exposes new functionality such as: * Multi interval import and query support: * We create multiple arrays (directories) in a single workspace - one per interval. So, if you wish to import intervals ("chr1", [ 1, 100M ]) and ("chr2", [ 1, 100M ]), you end up with 2 directories/arrays in the workspace with names chr1$1$100M and chr2$1$100M. The array names depend on the partition bounds. * During the read phase, the user only supplies the workspace. The array names are obtained by scanning the entries in the workspace and reading the right arrays. For example, if you wish to read ("chr2", [ 50, 50M] ), then only the second array is queried. In the previous version of the tool, the array name was a constant - genomicsdb_array. The new version will be backward compatible with respect to reads. Hence, if a directory named genomicsdb_array is found in the workspace directory, it's passed as the array for the GenomicsDBFeatureReader otherwise the array names are generated from the directory entry names. * Parallel import based on chromosome intervals. The number of threads to use can be specified as an integer argument to the executeImport call. If no argument is specified, the number of threads is determined by Java's ForkJoinPool (typically equal to the #cores in the system). * The max number of intervals to import in parallel can be controlled by the command line argument --max-num-intervals-to-import-in-parallel (default 1) Note that increasing parallelism increases the number of FeatureReaders opened to feed data to the importer. So, if you are using N threads and your batch size is B, you will have N*B feature readers open. * Protobuf based API for import and read #3688 #2687 * Option to produce GT field * Option to produce GT for spanning deletion based on min PL value * Doesn't support #4541 or #3689 yet - next version * Bug fixes * Fix for #4716 * More error messages

droazen · 2018-07-06T17:36:45Z

Implemented in #4645

droazen assigned kgururaj and unassigned kgururaj May 9, 2017

droazen added this to the beta milestone May 9, 2017

droazen added GenomicsDB question labels May 9, 2017

droazen mentioned this issue May 9, 2017

Too many alleles warning in GenotypeGVCFs #2688

Closed

droazen modified the milestones: 4.0 release, beta May 30, 2017

droazen removed this from the Engine-4.0 milestone Oct 17, 2017

francares mentioned this issue May 4, 2018

GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645

Merged

droazen closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenomicsDB: max # of alleles should be configurable #2687

GenomicsDB: max # of alleles should be configurable #2687

droazen commented May 9, 2017

droazen commented May 9, 2017

droazen commented May 9, 2017

kgururaj commented May 9, 2017

droazen commented Jul 6, 2018

GenomicsDB: max # of alleles should be configurable #2687

GenomicsDB: max # of alleles should be configurable #2687

Comments

droazen commented May 9, 2017

droazen commented May 9, 2017

droazen commented May 9, 2017

kgururaj commented May 9, 2017

droazen commented Jul 6, 2018