replace_reads.py only writing out one read of each pair for paired end. #212

BrianOSullivanGit · 2022-08-22T20:04:43Z

I'm having an issue with release 1.4 where only one read of each pair seems to be written out by replace_reads.

In the test case below I have a simple setup with a subsetted bam etc.,

[bosullivan@lugh TEST_SMALL_TC3]$ ls -ltr
total 16
-rw-rw-r-- 1 bosullivan bosullivan 23 Aug 22 20:01 test.spike
-rw-rw-r-- 1 bosullivan bosullivan 5412 Aug 22 20:01 tc.bam
-rw-rw-r-- 1 bosullivan bosullivan 736 Aug 22 20:01 tc.bam.bai
[bosullivan@lugh TEST_SMALL_TC3]$ cat test.spike
chr1 964961 964961 0.5

I run bamsurgeon with,
python3 ../../bamsurgeon-1.4/bin/addsnv.py --procs 16 --aligner mem --picardjar /home/bosullivan/bin/picard-tools-1.131/picard.jar -v test.spike -f tc.bam -r /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa -o tc.spike.bam 2>&1 | tee run.log

It runs without error, noting,

[bosullivan@lugh TEST_SMALL_TC3]$ tail -2 tc.spike.addsnv.test.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN
chr1 964961 . T G 100 PASS SOMATIC;VAF=0.4827586206896552;DPR=28.0 GT 0/1

However, if you look at the spike in bam, only one of the reads from each pair in the original pile-up are there, ie.,

samtools sort tc.spike.bam > tc.spike.sort.bam; samtools index tc.spike.sort.bam

[bosullivan@lugh TEST_SMALL_TC3]$ samtools view tc.spike.sort.bam -r "chr1:964961-964961"| cut -f1 | sort | uniq -c
1 T_X2_chr1-105658166
1 T_X2_chr1-114050428
1 T_X2_chr1-118104706
1 T_X2_chr1-119021410
1 T_X2_chr1-131492998
1 T_X2_chr1-134105122
1 T_X2_chr1-134498332
1 T_X2_chr1-137335478
1 T_X2_chr1-17341428
1 T_X2_chr1-25397966
1 T_X2_chr1-27489216
1 T_X2_chr1-33319224
1 T_X2_chr1-33609412
1 T_X2_chr1-35208886
1 T_X2_chr1-37384776
1 T_X2_chr1-4360962
1 T_X2_chr1-44825172
1 T_X2_chr1-45622772
1 T_X2_chr1-46771388
1 T_X2_chr1-47331784
1 T_X2_chr1-51102978
1 T_X2_chr1-63417112
1 T_X2_chr1-68746918
1 T_X2_chr1-70114492
1 T_X2_chr1-72960300
1 T_X2_chr1-80234100
1 T_X2_chr1-8309364
1 T_X2_chr1-88295864
1 T_X2_chr1-98425500

Whereas the original, pre-spike-in bam is,
[bosullivan@lugh TEST_SMALL_TC3]$ samtools view tc.bam -r "chr1:964961-964961"| cut -f1 | sort | uniq -c
2 T_X2_chr1-105658166
2 T_X2_chr1-114050428
2 T_X2_chr1-118104706
2 T_X2_chr1-119021410
2 T_X2_chr1-131492998
2 T_X2_chr1-134105122
2 T_X2_chr1-134498332
2 T_X2_chr1-137335478
2 T_X2_chr1-17341428
2 T_X2_chr1-25397966
2 T_X2_chr1-27489216
2 T_X2_chr1-33319224
2 T_X2_chr1-33609412
2 T_X2_chr1-35208886
2 T_X2_chr1-37384776
2 T_X2_chr1-4360962
2 T_X2_chr1-44825172
2 T_X2_chr1-45622772
2 T_X2_chr1-46771388
2 T_X2_chr1-47331784
2 T_X2_chr1-51102978
2 T_X2_chr1-63417112
2 T_X2_chr1-68746918
2 T_X2_chr1-70114492
2 T_X2_chr1-72960300
2 T_X2_chr1-80234100
2 T_X2_chr1-8309364
2 T_X2_chr1-88295864
2 T_X2_chr1-98425500

The pileup is not as expected given the addsnv.test.vcf either,

[bosullivan@lugh TEST_SMALL_TC3]$ samtools mpileup -f /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa tc.spike.sort.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 T 2 g^], G@

While the pre-spikein bam was,

[bosullivan@lugh TEST_SMALL_TC3]$ samtools mpileup -f /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa tc.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 T 29 ............,..............^,^, JBFDFJJJIJIDJCJE3IHJJIHDFFFD@

Just wondering if you would have any ideas whats happening? I ran addsnv.py in this way previously for release 1.3 without seeing this issue. Unfortunately I do not have access to a python2.7 system to double check it at the moment.

Any help very much appreciated as I am at a complete loss. This is a simple test case however I am having issues with all snvs I try to spike in in this way. I've attached the output of the addsnv.py to this fyi.
out.log

Thanks & regards, Brian

adamewing · 2022-08-22T23:15:51Z

Apologies, there is a regression in 1.4 that causes read tracking information to be lost (ed35087#r81884432), will update when fixed.

Rapsssito · 2022-08-22T23:17:16Z

@BrianOSullivanGit, apologies again. I introduced this issue. We'll work on it and release a fix ASAP.

BrianOSullivanGit · 2022-08-23T08:00:48Z

No worries, fair enough. I'll get 1.3 going again and wait for the update before going back to 1.4. Thanks for letting me know.

BrianOSullivanGit · 2022-08-23T09:57:47Z

Hi,
Just to let you know, I applied the above patch (fix212). It does not resolve the issue.
The spiked-in output does not contain read pairs, but instead read duplicates.

ie.,

After patch applied, ie.,
[bosullivan@lugh bin]$ diff bamsurgeon/replace_reads.py bamsurgeon/replace_reads.py.old
159a160,161

    #check if this read has been processed already. If so, skip to the next read
    if extqname in used: continue

Pre-spike-in read pairs,

$ samtools view tc.bam -r "chr1:964961-964961"| sort -k1,1 |cut -f1-4|more
T_X2_chr1-105658166 163 chr1 964844
T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-114050428 163 chr1 964929
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-118104706 99 chr1 964908
T_X2_chr1-119021410 163 chr1 964938
:
etc..

Post spike-in reads,

$ samtools view tc.spike.sort.bam -r "chr1:964961-964961"| sort -k1,1 |cut -f1-4|more

T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-119021410 83 chr1 965036
T_X2_chr1-119021410 83 chr1 965036
T_X2_chr1-131492998 83 chr1 964989
T_X2_chr1-131492998 83 chr1 964989
T_X2_chr1-134105122 83 chr1 965030
T_X2_chr1-134105122 83 chr1 965030
T_X2_chr1-134498332 83 chr1 965043
T_X2_chr1-134498332 83 chr1 965043
:
...duplicates introduced etc...

And there is nothing spiked in into the output etc.
Pre-spike-in
[bosullivan@lugh TEST_SMALL_TC2]$ samtools mpileup tc.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 N 29 TTTTTTTTTTTTtTTTTTTTTTTTTTT^~t^~t JBFDFJJJIJIDJCJE3IHJJIHDFFFD@

Post spike-in
[bosullivan@lugh TEST_SMALL_TC2]$ samtools mpileup tc.spike.sort.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 N 4 tt^]t^]t GJ@D
[bosullivan@lugh TEST_SMALL_TC2]$

Let me know if you need a simple test case to debug the issue. I think it should show up in all paired-end data however.

Thanks, Brian

Rapsssito · 2022-08-23T10:11:07Z

@BrianOSullivanGit, could you try the new updated version of the fix?

BrianOSullivanGit · 2022-08-23T11:34:40Z

Thanks for updating it. The last version of your fix appears to work. I am just running a larger test on a full bam now. Will let you know how it goes.

adamewing · 2022-08-23T13:52:03Z

Didn't mean to close, will leave open pending @BrianOSullivanGit to come back.

BrianOSullivanGit · 2022-08-24T11:52:28Z

Thanks guys, it seems to be working fine again now. Appreciate your help. Closing. Best regards, Brian

adamewing referenced this issue Aug 22, 2022

feat: Add support for CRAM output

ed35087

Rapsssito mentioned this issue Aug 23, 2022

fix: Fix replace_reads.py only writing out one read of each pair #213

Merged

adamewing closed this as completed in #213 Aug 23, 2022

adamewing reopened this Aug 23, 2022

BrianOSullivanGit closed this as completed Aug 24, 2022

adamewing mentioned this issue Aug 24, 2022

ValueError: quality and sequence mismatch: 98 != 97 #204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace_reads.py only writing out one read of each pair for paired end. #212

replace_reads.py only writing out one read of each pair for paired end. #212

BrianOSullivanGit commented Aug 22, 2022

adamewing commented Aug 22, 2022

Rapsssito commented Aug 22, 2022

BrianOSullivanGit commented Aug 23, 2022

BrianOSullivanGit commented Aug 23, 2022

Rapsssito commented Aug 23, 2022

BrianOSullivanGit commented Aug 23, 2022

adamewing commented Aug 23, 2022

BrianOSullivanGit commented Aug 24, 2022

replace_reads.py only writing out one read of each pair for paired end. #212

replace_reads.py only writing out one read of each pair for paired end. #212

Comments

BrianOSullivanGit commented Aug 22, 2022

adamewing commented Aug 22, 2022

Rapsssito commented Aug 22, 2022

BrianOSullivanGit commented Aug 23, 2022

BrianOSullivanGit commented Aug 23, 2022

Rapsssito commented Aug 23, 2022

BrianOSullivanGit commented Aug 23, 2022

adamewing commented Aug 23, 2022

BrianOSullivanGit commented Aug 24, 2022