Skip to content
This repository has been archived by the owner on Jul 17, 2023. It is now read-only.

invalid base (char): '=' / erroneous "Failed to populate reference' condition when calling from CRAM files #109

Closed
bw2 opened this issue Nov 10, 2017 · 13 comments

Comments

@bw2
Copy link

bw2 commented Nov 10, 2017

I've tried running Manta with default settings on multiple GRCh38-aligned WGS .crams, and am always getting this error:

 [ERROR] [generateCandidateSV_0248] Error Message:
 [ERROR] [generateCandidateSV_0248] Last 20 stderr lines from task (of 2482 total lines):
 [ERROR] [2017-11-10T00:27:13.548825] [manta.c.seqr-project.internal] [7185_1] [generateCandidateSV_0248] Failed to populate reference for id 1692
 [ERROR] [2017-11-10T00:27:13.754959] [manta.c.seqr-project.internal] [7185_1] [generateCandidateSV_0248] Failed to populate reference for id 1693
...
 [ERROR] [2017-11-10T00:27:16.842511] [manta.c.seqr-project.internal] [7185_1] [generateCandidateSV_0248] ERROR:: Invalid base in comp_base.
 [ERROR] [2017-11-10T00:27:16.848535] [manta.c.seqr-project.internal] [7185_1] [generateCandidateSV_0248]                 invalid base (char): '='
 [ERROR] [2017-11-10T00:27:16.861463] [manta.c.seqr-project.internal] [7185_1] [generateCandidateSV_0248]                 invalid base (int): 61

Glancing at the source code, it looks like Manta only supports ACGTN bases

comp_base(char a)

?

When I grep through the GRCh38 fasta though, I find a few other bases like W, M, B, etc.:

~$  grep '[^ACGTN]' Homo_sapiens_assembly38.fasta

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
CACACACCAAACAMCCCACACAACACACACACACCACACCACACAAACACAAACACACCACATCATGAACACACACATCACACACACACCACACACCCCA
AGGTCAGGACTTGACCCAGGGRCACCTAGCAGAGAAACAGAGCCTGGGAGTGACCCCATAGCAGGACCCAGGGCCGTGCTCAACACCTCTGTGGGTAAGA
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
GATGTGTGTGTCTGATATGTGGTGCGGTGTGTAGTGTGGCATGTGGTGTGTTTGTATGTGGTATATATCGTGTAYGTGTACAGGTGTGTTTGTGTGTGTG
CCACACACACTGCAAACRCACACCCCACACATGCTGCAAACACACACCATACACACACATATTATACAAACACATCACACACCATACACACACCACACAC
GGAGTCCTCCACATCCTGGCCTCTACCCATGAGATGCCATTAGCATCCCCCAGCTGTGACAACTGGAAAYGTCTCAGACATTGCCAAATGTCCCCTGGAG
TGWGATGGYGGAGTGATGGTTATTGGGAYGGGGAATGGTGTGGTGTCTAAGGGTGGTAGTGATGGTGATGACAGAGGTGGTCATGATCATGGGAGGGCTG
GCTGATGACTCTGGTGGTGCTGTTGAAGGGGATCCTATTGATCATGKCAGTGCCATTGATCATTGTGCAGAGGCAGCAGTGGTAATGGAGGTGGTGATAG
CTGCCTGGTGATAACCCTGGTTCCATTCTGCTTTGTATTCACATTAGCCATACATGTGTCTGATTTTCTCATTCCACTTCTTATTWTGAACTTAATAAGG
ACCACACATATCACACACCACACATGTCACACATTACACATACACACCACACATCACACAGATACACCACACACAACACAGATGCACACAMCACACATCA
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
AAGGAGAAATTGACTCTACCTCTTGATAAGAGAAGTAGCATGAATATTCACAGARGGAAGGAATTGATGGTGGCCATATTTGGAGACTTTCTACCACAGT
ATGCATAGCACTTCTGTGAGCATCTTTCCTGCGCAGAAAATCACACCTTAAAATAAGTGGGCAYTGATTTCTCTTGCTCCCATCTGTGTTTATCTTAGTT
TTTTTATTTTWAAAAAAATGGTATGTAGGACTTAATTTACATTTTCTGAATATGTCTTTTGTGATATATATATATATTCAAGTTGTTTATATGGTTTAAT
TGGTAAACAGAGCTACACTCCATCTCAAAAAAAAAAAAAAAAAAATGCAGTGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGBGCGCATGTGCATGCTTT
ACACCTTGAAACTGACTGACCTCCCTGCTGCTGAACTTGACCTTTATCTTATYCCTCAACCTCACCTTAATCATCACTCTGACCCTGACCAGCACTCTGA
CCTGGGCAATGTGGCGAAACYCCATCTCTACAAAAATACAAAAAACTAGCCAGGTATGGTAACACACACCTGTAGTTCCAGCTACTCAGGAGTCTGAGGC
...

Either way, I'm not seeing '=' chars in the reference, and would appreciate any help with debugging this.

@ctsa
Copy link
Contributor

ctsa commented Nov 10, 2017

The error message should refer to an '=' base in one of the input reads. The ambiguous bases in the reference should not trigger an error (any non [ACGT] base in the reference is just converted to N).

Although '=' in the reads is in the BAM spec, we do not support it, in part because we've never encountered this in any real world workflow. This is the second report we have of this error occurring fromk CRAM input. Can you describe how this cram file was created? (ie. what was the mapper, and what software was used to convert SAM/BAM to CRAM). If the use of '=' in the reads is going to become a common side-effect of CRAM compression we should prioritize supporting this.

@bw2
Copy link
Author

bw2 commented Nov 10, 2017

I believe it's samtools, but I'm double checking.

@ctsa
Copy link
Contributor

ctsa commented Nov 13, 2017

Thanks, we haven't observed this yet but will try to reproduce with the latest samtools/htslib release.

@ctsa
Copy link
Contributor

ctsa commented Nov 17, 2017

Hi @bw2 - We haven't been able to reproduce the problem, if you can identify the mapper/cram compression tool that is producing this error we'll look further into it. -c

@ctsa ctsa closed this as completed Nov 17, 2017
@anthonycor
Copy link

I've also run into this same error.

[WorkflowRunner] [ERROR] [generateCandidateSV_0253] Last 8 stderr lines from task (of 8 total lines):
[WorkflowRunner] [ERROR] [2017-12-21T18:33:37.183198] [bhc0128.private] [6232_1] [generateCandidateSV_0253] Failed to populate reference for id 59
[WorkflowRunner] [ERROR] [2017-12-21T18:33:37.383449] [bhc0128.private] [6232_1] [generateCandidateSV_0253] Failed to populate reference for id 59
[WorkflowRunner] [ERROR] [2017-12-21T18:33:37.641829] [bhc0128.private] [6232_1] [generateCandidateSV_0253] Failed to populate reference for id 59
[WorkflowRunner] [ERROR] [2017-12-21T18:34:11.364342] [bhc0128.private] [6232_1] [generateCandidateSV_0253] Failed to populate reference for id 42
[WorkflowRunner] [ERROR] [2017-12-21T18:34:11.577000] [bhc0128.private] [6232_1] [generateCandidateSV_0253] Failed to populate reference for id 42
[WorkflowRunner] [ERROR] [2017-12-21T18:34:11.588452] [bhc0128.private] [6232_1] [generateCandidateSV_0253] ERROR:: Invalid base in comp_base.
[WorkflowRunner] [ERROR] [2017-12-21T18:34:11.592149] [bhc0128.private] [6232_1] [generateCandidateSV_0253]                 invalid base (char): '='
[WorkflowRunner] [ERROR] [2017-12-21T18:34:11.593081] [bhc0128.private] [6232_1] [generateCandidateSV_0253]                 invalid base (int): 61

The steps from FASTQ to CRAM that read/write bam or cram are from the Broad Institute's wgs-germline-snps-indels-PairedEndSingleSampleWorkflow.wdl:

bwa mem -> picard MergeBamAlignments -> picard MarkDuplicates -> picard SortSam -> gatk4 ApplyBQSR -> picard GatherBamFiles -> samtools

samtools command:

    samtools view -C -T ${ref_fasta} ${input_bam} | \
    tee ${output_basename}.cram | \
    md5sum | awk '{print $1}' > ${output_basename}.cram.md5

Running picard's ValidateSamFile in verbose mode (only ignoring MIISING_TAG_NM) returns no errors in the cram file produced by samtools.

All software was installed from bioconda:

bwa                       0.7.16
gatk4                     4.0b5
picard                    2.13
samtools                  1.3.1

binary release of manta 1.2.2 was downloaded from the release page on github.

System:
Linux bhc0128.private 3.10.0-693.el7.x86_64 x86_64 x86_64 x86_64 GNU/Linux

@ctsa
Copy link
Contributor

ctsa commented Jan 4, 2018

Thanks Anthony, Sorry just getting to this now. Any chance you can share this CRAM file?

@joonan30
Copy link

Also got this error.

[2018-01-24T09:24:07.141169] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [generateCandidateSV_0198] Error Message:
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [generateCandidateSV_0198] Last 20 stderr lines from task (of 1356 total lines):
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:24.303050] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 770
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:24.480527] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 771
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:24.663024] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 773
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:24.847834] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 774
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:25.156973] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 775
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:25.514701] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 776
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:25.705616] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 777
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:25.989233] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 779
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:26.163991] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 783
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:26.341956] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 784
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:26.506627] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 785
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:26.671895] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 786
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:26.845769] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 787
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.022411] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 788
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.188595] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 789
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.353499] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 790
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.523337] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] Failed to populate reference for id 791
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.524966] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198] ERROR:: Invalid base in comp_base.
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.526635] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198]                 invalid base (char): '='
[2018-01-24T09:24:07.145539] [ip-private.ec2.internal] [2957_1] [WorkflowRunner] [ERROR] [2018-01-24T09:01:27.528132] [ip-private.ec2.internal] [2957_1] [generateCandidateSV_0198]                 invalid base (int): 61

I have used

manta 1.2.2
samtools 1.6.1
Amazon Linux AMI 2017.09

@ctsa
Copy link
Contributor

ctsa commented Jan 24, 2018

Hi Joon, Thanks for reporting this. Can you share any data that would reproduce this issue?

Your error message helpfully points out that there is an internal htslib error proceeding this issue ("Failed to populate reference for id XXX"). We already have some improved exception messages and an update to htslib 1.6 in the next manta release, but I will check if we can do more to appropriately capture this particular error from htslib. In the meantime, being able to reproduce this issue from the alignment files would be the most effective way to create a reliable fix if you can share.

@ctsa
Copy link
Contributor

ctsa commented Jan 29, 2018

Thanks to additional details from @joonan30 and others on this thread, I have been able to confirm that the issue arises from an htslib behavior. This does not reflect the use of the '='/ANY symbol in the source BAM/CRAM file. I have described the issue for htslib here samtools/htslib#654 and it has been reproduced outside of manta there.

I will reopen this issue in recognition of the linked htslib issue, and close this once it is resolved in htslib and the update is forwarded into manta's htslib copy.

@ctsa ctsa reopened this Jan 29, 2018
@ctsa ctsa changed the title invalid base (char): '=' invalid base (char): '=' / erroneous "Failed to populate reference' condition when calling from CRAM files Jan 29, 2018
@joonan30
Copy link

joonan30 commented Feb 6, 2018

Hi Chris,

Thanks for checking this. I noticed the new release, which I guess .. not for this issue?
I was able to do quick run with the latest Manta version but the error still happened.

@ctsa
Copy link
Contributor

ctsa commented Feb 6, 2018

The fix is in process, but it is not in the Manta v1.3.0 release.

With your help I was able to generate htslib issue samtools/htslib#654 describing the error details in htslib, and the corresponding fixes to htslib just merged today (here: samtools/htslib#655). I'd like us to update manta's htslib from this merge point and issue a patch update to Manta, but it will take at least a few days to fully turn the crank on this process.

@ctsa
Copy link
Contributor

ctsa commented Feb 8, 2018

The fix for this issue is now in manta's development branch. A preview of the fix is pushed to github here:

https://github.com/Illumina/manta/tree/issue109fixPreview

It will be included in a patch update to v1.3.0, coming soon.

@ctsa
Copy link
Contributor

ctsa commented Feb 19, 2018

Released updated htslib with v1.3.1 today, this should close the CRAM stability issues described here.

@ctsa ctsa closed this as completed Feb 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants