elPrep v5.0.0 release
Binaries can be downloaded here:
https://www.imec-int.com/en/expertise/lifesciences/genomics/dna-sequence-analysis-software
The major new feature of elPrep 5 is the addition of variant calling, which means that elPrep can now do a full variant calling pipeline on its own, starting from an aligned BAM file, and producing a VCF file. We follow the haplotype caller algorithm.
There are a number of additional improvements and changes, some of which, but not all related to variant calling.
Functionality
- The option —haplotypecaller for variant calling
Tool changes
- The previous --bqsr-reference option has been renamed to --reference because it is also used for the Haplotypecaller.
- There exist different semantics in different tools for implementing -L options that are used to filter reads based on genomic regions. The already existing --remove-non-overlapping-reads option implements a different option from the newly added —target-regions, which is especially relevant for the Haplotypecaller. If you use the --remove-non-overlapping-reads option, reads outside of the regions of the given BED file will be removed, but the variant calling will not be restricted to the regions in that BED file, which may lead to surprisingly large VCF files. If you want to restrict variant calling to those regions, use --target-regions instead. A peculiar corner case occurs when you use base quality score recalibration and --target-regions in the same pipeline -, the reads outside of the BED region will then effectively also be removed (just a bit later in the pipeline than with --remove-non-overlapping-reads). There are other peculiar effects. For example, the —target-regions option does not restrict the variant caller exactly to the BED regions, but adds some padding around those regions, so effectively processes reads outside of these regions as well. We carefully covered all these corner cases in detail to ensure elPrep’s result are identical to these semantics.
- Comparing reads by coordinate order is now more fine-grained.
- We have removed the previously already deprecated original filter command that existed only for compatibility with very old versions of elPrep. This should not matter for the majority of end users.
- We have dropped the undocumented --deterministic and --mark-duplicates-deterministic options. Marking duplicates is now always deterministic. The --deterministic option has been replaced with compile-time options. They are rarely interesting for end users.
File handling and formats
- We have improved VCF parsing and formatting to be in line with Haplotypecaller requirements. For example, the GT field is now explicitly supported, among other things.
- We do not require the presence of bcftools for parsing or formatting .vcf.gz files anymore, but now handle them completely ourselves. As a downside, the BCF format is not supported anymore in elPrep 5. If you need BCF support, consider converting between BCF and VCF separately, for example with bcftools.
- Whether input files are gzip-compressed (for example, BAM or .vcf.gz files) is not determined anymore by file extensions, but by looking at the actual contents of the input files. This makes elPrep more stable with regard to non-standard file extensions, and for example, now also allows for accepting BAM files as inputs from Unix pipes.
- We now support an --output-type parameter to select SAM or BAM format for output. This is useful, for example, if you want to send BAM files to Unix pipes.
- When parsing BED files, we do not process comment, track, or browser lines anymore, but simply ignore them now.
API
- We have dropped Go-style error handling with error return codes in the elPrep source code in most places, in favor of exception-style error handling. For the end user, this difference doesn’t matter, but this is primarily for making lives easier for developers.
- We dropped support for concurrent access to ELFASTA files in favor of exclusively using memory-mapped access.
Performance
- Various other bug fixes and improvements.