Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when running Longshot with too few reads/too small coverage using -A #72

Open
Coppini opened this issue Aug 23, 2021 · 2 comments

Comments

@Coppini
Copy link
Contributor

Coppini commented Aug 23, 2021

I'm having problems with Longshot when running the Artic pipeline (even after updating Longshot to v0.4.3, since Artic still uses v0.4.1), in this command:

longshot \
    -P 0 \
    -F -A \
    --no_haps \
    --bam barcode91.primertrimmed.rg.sorted.bam \
    --ref primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.reference.fasta \
    --out barcode91.merged.vcf \
    --potential_variants barcode91.merged.vcf.gz

The BAM file I'm using there has only one read, with few bases, resulting in a very small coverage. This was expected, since these were negative samples, where we shouldn't be finding (Sars-Cov-2) reads.
Longshot ends up erroring with the following messages:

2021-08-23 10:03:32 Automatically determining max read coverage.
2021-08-23 10:03:32 Estimating mean read coverage...
2021-08-23 10:03:32 WARNING: Max coverage calculation is highly likely to be incorrect. The number of reference bases covered by the bam file (857) differs significantly from the expected number of positions in the reference (29903). If you are using a bam file that only covers part of the genome, please specify this region exactly with the --region argument so the number of reference bases is known. Alternatively, disable maximum coverage filtering by setting -C to a large number.
2021-08-23 10:03:32 Total reference positions: 29903
2021-08-23 10:03:32 Total bases in bam: 857
2021-08-23 10:03:32 Mean read coverage: 0.03
error: {} ERROR: Max read coverage set to 0. printing empty VCF file

It seems the problem comes from the -A flag:

-A, --auto_max_cov        Automatically calculate mean coverage for region and set max coverage to mean_coverage +
                              5*sqrt(mean_coverage). (SLOWER)

Since Mean read coverage = 0.03, it calculates Max coverage = 0.03+5*sqrt(0.03) = 0.89602540378, and rounds down to 0.

Would it be possible to have -A round the maximum coverage up, instead of down?

@vibansal
Copy link
Collaborator

Longshot uses the input reads to estimate sequencing error rates so it will not work well for files with very few reads. Can you not use the -A flag?

@Coppini
Copy link
Contributor Author

Coppini commented Aug 30, 2021

Yes, I believe that would be better as well. I'm using the existing Artic pipeline as mentioned, but already opened an issue with them concerning this problem. artic-network/fieldbioinformatics#91

Thank you either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants