Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longshot on Iso-seq data #106

Open
camelest opened this issue Aug 28, 2024 · 5 comments
Open

Longshot on Iso-seq data #106

camelest opened this issue Aug 28, 2024 · 5 comments

Comments

@camelest
Copy link

Hi, thank you so much for the great tool.

I'm trying to apply longshot on long-read RNA-seq data (Iso-seq data from PacBio HiFi reads).
I ran the pipeline with default options and got ~30% reads assigned with HP tags (Is is true to assume HP:i:1 represents reads with REF SNPs and HP:i:2 with ALT SNPs?)
Do you have any recommendations to recover more reads?

I was wondering whether I should loosen

-I, --max_cigar_indel <int>                Throw away a read-variant during allelotyping if there is a CIGAR indel
                                               (I/D/N) longer than this amount in its window. [default: 20]

since the RNA-seq reads CIGAR skips large introns.

Thank you so much for your kind help.

@vibansal
Copy link
Collaborator

vibansal commented Sep 5, 2024

One way to avoid excluding these reads would be to modify the bam/cram file to split the reads containing long 'N' cigar operations into multiple reads. See for example:

https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads

@camelest
Copy link
Author

camelest commented Oct 1, 2024

@vibansal Thank you so much for your reply. Sorry for my late reply, I somehow did not receive the notice.

I tried with SplitNCigarReads. For some regions, even though the reads look pretty nice as a SNP, it got filtered out at the final step. (included in 4.0.final_genotypes.vcf but not in the final result) Does it have something to do with other settings?

The output in 4.0.final_genotypes.vcf is

chrX    103482802       .       T       G       1       dp      DP=9850;AM=0;MC=0;MF=0.000;MB=0.000;AQ=25.62;GM=0;PH=0.00,37.78,37.78,37.78;SC=None;    GT:GQ:DP:AD:PS:UG:UQ    0/0:34:9850:5355,4495:.:0/1:127.63

@vibansal
Copy link
Collaborator

Sorry for the late reply. The variant is filtered since it has very high depth (9850). The "-max_cov" option can be used to increase the threshold.

@camelest
Copy link
Author

camelest commented Oct 20, 2024

Thank you so much for the reply. I tried with --max_cov 100000 and now the dp changed to PASS but I still don't find it in the final output.
Does the final quality filtering occur at this step?

(in 4.0.final_genotypes.vcf)

chrX    103482802       .       T       G       1       PASS    DP=9850;AM=0;MC=0;MF=0.000;MB=0.000;AQ=25.62;GM=0;PH=0.00,37.78,37.78,37.78;SC=None;    GT:GQ:DP:AD:PS:UG:UQ    0/0:34:9850:5355,4495:.:0/1:127.63

@vibansal
Copy link
Collaborator

vibansal commented Nov 7, 2024

The genotype of the variant is 0/0 and the final output contains only variants with non-ref genotypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@vibansal @camelest and others