Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace_reads.py only writing out one read of each pair for paired end. #212

Closed
BrianOSullivanGit opened this issue Aug 22, 2022 · 8 comments · Fixed by #213
Closed

replace_reads.py only writing out one read of each pair for paired end. #212

BrianOSullivanGit opened this issue Aug 22, 2022 · 8 comments · Fixed by #213

Comments

@BrianOSullivanGit
Copy link

I'm having an issue with release 1.4 where only one read of each pair seems to be written out by replace_reads.

In the test case below I have a simple setup with a subsetted bam etc.,

[bosullivan@lugh TEST_SMALL_TC3]$ ls -ltr
total 16
-rw-rw-r-- 1 bosullivan bosullivan 23 Aug 22 20:01 test.spike
-rw-rw-r-- 1 bosullivan bosullivan 5412 Aug 22 20:01 tc.bam
-rw-rw-r-- 1 bosullivan bosullivan 736 Aug 22 20:01 tc.bam.bai
[bosullivan@lugh TEST_SMALL_TC3]$ cat test.spike
chr1 964961 964961 0.5

I run bamsurgeon with,
python3 ../../bamsurgeon-1.4/bin/addsnv.py --procs 16 --aligner mem --picardjar /home/bosullivan/bin/picard-tools-1.131/picard.jar -v test.spike -f tc.bam -r /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa -o tc.spike.bam 2>&1 | tee run.log

It runs without error, noting,

[bosullivan@lugh TEST_SMALL_TC3]$ tail -2 tc.spike.addsnv.test.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN
chr1 964961 . T G 100 PASS SOMATIC;VAF=0.4827586206896552;DPR=28.0 GT 0/1

However, if you look at the spike in bam, only one of the reads from each pair in the original pile-up are there, ie.,

samtools sort tc.spike.bam > tc.spike.sort.bam; samtools index tc.spike.sort.bam

[bosullivan@lugh TEST_SMALL_TC3]$ samtools view tc.spike.sort.bam -r "chr1:964961-964961"| cut -f1 | sort | uniq -c
1 T_X2_chr1-105658166
1 T_X2_chr1-114050428
1 T_X2_chr1-118104706
1 T_X2_chr1-119021410
1 T_X2_chr1-131492998
1 T_X2_chr1-134105122
1 T_X2_chr1-134498332
1 T_X2_chr1-137335478
1 T_X2_chr1-17341428
1 T_X2_chr1-25397966
1 T_X2_chr1-27489216
1 T_X2_chr1-33319224
1 T_X2_chr1-33609412
1 T_X2_chr1-35208886
1 T_X2_chr1-37384776
1 T_X2_chr1-4360962
1 T_X2_chr1-44825172
1 T_X2_chr1-45622772
1 T_X2_chr1-46771388
1 T_X2_chr1-47331784
1 T_X2_chr1-51102978
1 T_X2_chr1-63417112
1 T_X2_chr1-68746918
1 T_X2_chr1-70114492
1 T_X2_chr1-72960300
1 T_X2_chr1-80234100
1 T_X2_chr1-8309364
1 T_X2_chr1-88295864
1 T_X2_chr1-98425500

Whereas the original, pre-spike-in bam is,
[bosullivan@lugh TEST_SMALL_TC3]$ samtools view tc.bam -r "chr1:964961-964961"| cut -f1 | sort | uniq -c
2 T_X2_chr1-105658166
2 T_X2_chr1-114050428
2 T_X2_chr1-118104706
2 T_X2_chr1-119021410
2 T_X2_chr1-131492998
2 T_X2_chr1-134105122
2 T_X2_chr1-134498332
2 T_X2_chr1-137335478
2 T_X2_chr1-17341428
2 T_X2_chr1-25397966
2 T_X2_chr1-27489216
2 T_X2_chr1-33319224
2 T_X2_chr1-33609412
2 T_X2_chr1-35208886
2 T_X2_chr1-37384776
2 T_X2_chr1-4360962
2 T_X2_chr1-44825172
2 T_X2_chr1-45622772
2 T_X2_chr1-46771388
2 T_X2_chr1-47331784
2 T_X2_chr1-51102978
2 T_X2_chr1-63417112
2 T_X2_chr1-68746918
2 T_X2_chr1-70114492
2 T_X2_chr1-72960300
2 T_X2_chr1-80234100
2 T_X2_chr1-8309364
2 T_X2_chr1-88295864
2 T_X2_chr1-98425500

The pileup is not as expected given the addsnv.test.vcf either,

[bosullivan@lugh TEST_SMALL_TC3]$ samtools mpileup -f /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa tc.spike.sort.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 T 2 g^], G@

While the pre-spikein bam was,

[bosullivan@lugh TEST_SMALL_TC3]$ samtools mpileup -f /data/bosullivan/XXXXXXXXX/X2_HG00110.coding_exons_hg38.fa tc.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 T 29 ............,..............^,^, JBFDFJJJIJIDJCJE3IHJJIHDFFFD@

Just wondering if you would have any ideas whats happening? I ran addsnv.py in this way previously for release 1.3 without seeing this issue. Unfortunately I do not have access to a python2.7 system to double check it at the moment.

Any help very much appreciated as I am at a complete loss. This is a simple test case however I am having issues with all snvs I try to spike in in this way. I've attached the output of the addsnv.py to this fyi.
out.log

Thanks & regards, Brian

@adamewing
Copy link
Owner

Apologies, there is a regression in 1.4 that causes read tracking information to be lost (ed35087#r81884432), will update when fixed.

@Rapsssito
Copy link
Collaborator

@BrianOSullivanGit, apologies again. I introduced this issue. We'll work on it and release a fix ASAP.

@BrianOSullivanGit
Copy link
Author

No worries, fair enough. I'll get 1.3 going again and wait for the update before going back to 1.4. Thanks for letting me know.

@BrianOSullivanGit
Copy link
Author

Hi,
Just to let you know, I applied the above patch (fix212). It does not resolve the issue.
The spiked-in output does not contain read pairs, but instead read duplicates.

ie.,

After patch applied, ie.,
[bosullivan@lugh bin]$ diff bamsurgeon/replace_reads.py bamsurgeon/replace_reads.py.old
159a160,161

    #check if this read has been processed already. If so, skip to the next read
    if extqname in used: continue

Pre-spike-in read pairs,

$ samtools view tc.bam -r "chr1:964961-964961"| sort -k1,1 |cut -f1-4|more
T_X2_chr1-105658166 163 chr1 964844
T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-114050428 163 chr1 964929
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-118104706 99 chr1 964908
T_X2_chr1-119021410 163 chr1 964938
:
etc..

Post spike-in reads,

$ samtools view tc.spike.sort.bam -r "chr1:964961-964961"| sort -k1,1 |cut -f1-4|more

T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-105658166 83 chr1 964925
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-114050428 83 chr1 965027
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-118104706 147 chr1 965003
T_X2_chr1-119021410 83 chr1 965036
T_X2_chr1-119021410 83 chr1 965036
T_X2_chr1-131492998 83 chr1 964989
T_X2_chr1-131492998 83 chr1 964989
T_X2_chr1-134105122 83 chr1 965030
T_X2_chr1-134105122 83 chr1 965030
T_X2_chr1-134498332 83 chr1 965043
T_X2_chr1-134498332 83 chr1 965043
:
...duplicates introduced etc...

And there is nothing spiked in into the output etc.
Pre-spike-in
[bosullivan@lugh TEST_SMALL_TC2]$ samtools mpileup tc.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 N 29 TTTTTTTTTTTTtTTTTTTTTTTTTTT^~t^~t JBFDFJJJIJIDJCJE3IHJJIHDFFFD@

Post spike-in
[bosullivan@lugh TEST_SMALL_TC2]$ samtools mpileup tc.spike.sort.bam -r "chr1:964961-964961"
[mpileup] 1 samples in 1 input files
chr1 964961 N 4 tt^]t^]t GJ@D
[bosullivan@lugh TEST_SMALL_TC2]$

Let me know if you need a simple test case to debug the issue. I think it should show up in all paired-end data however.

Thanks, Brian

@Rapsssito
Copy link
Collaborator

@BrianOSullivanGit, could you try the new updated version of the fix?

@BrianOSullivanGit
Copy link
Author

Thanks for updating it. The last version of your fix appears to work. I am just running a larger test on a full bam now. Will let you know how it goes.

@adamewing
Copy link
Owner

Didn't mean to close, will leave open pending @BrianOSullivanGit to come back.

@BrianOSullivanGit
Copy link
Author

Thanks guys, it seems to be working fine again now. Appreciate your help. Closing. Best regards, Brian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants