Python IO libraries #671

hammer · 2021-09-14T14:04:49Z

hammer
Sep 14, 2021
Maintainer

Let's use this topic to collect useful libraries for reading and writing data stored in files with formats commonly used by the statistical genetics community.

PLINK 1.9 File format reference
PLINK 2.0 File format reference
Hail Export/Import
Glow Read and Write VCF, Plink, and BGEN with Spark
@eczech's Toolkit Comparison: cf. Import and Export rows

hammer · 2021-09-14T14:05:02Z

hammer
Sep 14, 2021
Maintainer Author

PySnpTools

GitHub
Docs

Efficiently read genetic PLINK formats including *.bed/bim/fam files. Also, efficiently read parts of files, read kernel data, and standardize data. New features include on-the-fly SNP generation, larger in-memory data, and cluster-ready BED data.

This library was initially authored by Microsoft Research and now appears to be maintained by Carl Kadie (@CarlKCarlK on GitHub), a former Microsoft Research employee who is now retired, according to his LinkedIn profile.

0 replies

hammer · 2021-09-14T14:05:12Z

hammer
Sep 14, 2021
Maintainer Author

Pysam

GitHub
Docs

Pysam is a Python module for reading and manipulating SAM/BAM/VCF/BCF files. It's a lightweight wrapper of the htslib C-API, the same one that powers samtools, bcftools, and tabix.

Over a decade old! Andreas Heger (@AndreasHeger on GitHub) seems to be the original author and is still active in the repo. According to his LinkedIn profile, he's now at Genomics plc. John Marshall (@jmarshall) is also active and from his LinkedIn profile he appears to work in cancer genomics in Glasgow.

The workhorse module in this library appears to be libcbcf.pyx, a Cython wrapper for htslib.

0 replies

hammer · 2021-09-14T14:05:20Z

hammer
Sep 14, 2021
Maintainer Author

bgen-reader

GitHub
Docs

A bgen file format reader.

Bgen is a file format for storing large genetic datasets. It supports both unphased genotypes and phased haplotype data with variable ploidy and number of alleles. It was designed to provides a compact data representation without sacrificing variant access performance.

This python package is a wrapper around the bgen library, a low-memory footprint reader that efficiently reads bgen files. It fully supports the bgen format specifications: 1.2 and 1.3; as well as their optional compressed formats.

A library from Oliver Stegle's group at the EMBL Heidelberg. Danilo Horta (@horta on GitHub) is the primary developer of both the C library and the Python wrapper.

0 replies

hammer · 2021-09-14T14:05:30Z

hammer
Sep 14, 2021
Maintainer Author

(Post by @tomwhite)

BED (PySnpTools)

From the point of view of splitting (being able to read a file in parallel, by reading a chunk without having to read all the bytes in the file up to the start of a chunk), BED files are simple to read.

BED files are not compressed, and when stored in "SNP-major" form (the default) all the samples for each SNP are stored together, 4 genotypes per byte, aligned on byte boundaries. So if there are M samples, then each SNP (row) is encoded in exactly ceil(M/4) bytes. This makes is trivial to seek to any SNP in the file. In particular, we can read Dask chunks efficiently for the genotype counts/calls data representation.

In terms of testing on real-world PLINK files, a "minitest" for basic reading and writing was added recently, which might be a good place to contribute to. Also, some generated (synthetic) tests were added at the end of last year.

BGEN (bgen-reader)

For splitting: BGEN is effectively “SNP-major”, in that each variant is stored in turn in a (possibly compressed) data block. The offsets of these data blocks is not stored in one place in the file, so without an index it is not possible to easily split the file into chunks. The bgenix tool defines a BGEN index format that is stored as a sqlite file. (This index format is used by Glow, but not by Hail, which uses its own index format.)

The bgen library doesn’t use bgenix, but instead creates a “metafile” the first time the file is loaded, which divides the BGEN file into a set number of partitions, which can later be read independently. The bgen-reader-py library returns a dask dataframe, using dask delayed to load partitions. There would be some work to change this to load xarray instead.

For testing, bgen-reader-py stores example data on S3, which would make it feasible to contribute test samples at some point.

VCF (Pysam)

For splitting, since they are plain text VCF files can be trivially split on newline boundaries. However, VCF files are usually compressed using bgzip, which can be split using heuristics, or by using an index file, such as the .gzi index. I would recommend creating an index (a one-off task), since it uses existing tools and avoids introducing another source of hard-to-debug bugs. I'm not sure how easy it would be to use Pysam to read a bgzipped VCF in chunks to load into xarray/dask (i.e. I'm not sure if it's possible out-of-the-box, or if Pysam would need any modification).

From a testing point of view, Pysam uses htslib, which is very popular and well-tested in general, although .gzi is fairly niche so might need some attention.

0 replies

hammer · 2021-09-14T14:05:40Z

hammer
Sep 14, 2021
Maintainer Author

PyBGEN

GitHub
Docs
License: MIT

Louis-Philippe Lemieux Perreault (@lemieuxl) is the sole developer, working at the Beaulieu-Saucier Université de Montréal Pharmacogenomics Centre (@pgxcentre). He has also developed lemieuxl/pyplink and pgxcentre/geneparse, a library with a unified interface to parsing PLINK, BGEN, VCF, and other file formats.

Comments from @tomwhite's gwas-analysis#36 PR:

pure Python (so may need work to optimize for large BGEN files: we shall see). The advantage over bgen-reader is that PyBGEN uses BGEN index files, whereas bgen-reader uses its own 'metafile'. The main problem I saw with bgen-reader is that it opens a new file for every variant it reads, while PyBGEN opens a new file for each batch of variants that are being read (and uses the index to seek appropriately).

Additional comments from gwas-analysis#38:

PyGEN is pure Python, bgen-reader is a Python wrapper around a C implementation.

bgen-reader doesn't use the bgenix index, it creates a metafile instead, which fulfils a similar role. It would be nice if it used a bgenix index since these are standard (e.g. UKBB ships the index with its BGEN files).

I mistakenly said (in #36) that bgen-reader opens the file every time it reads a variant. This is wrong - it simply seeks to the relevant point in the file, just like PyBGEN.

I found bgen-reader's API impossible to use in an efficient way since it seems geared toward random access, rather than sequential (parallel) access. In this PR I am using protected parts of bgen-reader. Ideally, we would improve its API if we decide to support this code.

I haven't been able to compare the performance of the two yet, but I would like to when we need to load bigger datasets.

bgen-reader seems to have better support for reading samples from a separate .sample side file.

0 replies

hammer · 2021-09-14T14:06:08Z

hammer
Sep 14, 2021
Maintainer Author

GCP Variant Transforms

GitHub
Docs
License: Apache 2.0

This library, developed by Google, uses the Python SDK for Apache Beam to round trip variant data from VCF to BigQuery and back. It includes a VCF preprocessor that could be useful for us. The current maintainer is Saman Vaisipour (@samanvp).

Currently the code is in Python 2 (yeesh) but they claim to be converting it to Python 3 within the next few months. They also claim to be targeting Avro as the intermediate file format before loading into BigQuery.

0 replies

hammer · 2021-09-14T14:06:15Z

hammer
Sep 14, 2021
Maintainer Author

PyVCF

GitHub
Docs
License: funky! Copyright assigned to a company (Population Genetics Technologies) using the 3-clause BSD license and an individual (John Dougherty) using an MIT license.

No commits since early 2017. Seems to have been primarily developed by Aaron Quinlan and Martijn Vermaat in 2012. It's unclear why James Casbon owns the repo, as I don't see any commits from him? PyVCF#314 points users to brentp/cyvcf2.

0 replies

hammer · 2021-09-14T14:06:23Z

hammer
Sep 14, 2021
Maintainer Author

scikit-allel

GitHub
Docs
License: MIT

@alimanfoo nicely documents his allel.read_vcf in his blog post Extracting data from VCF files (2017).

Most of the code is in Cython at io_vcf_read.pyx. It appears to only rely on htslib for tabix support (when using the region parameter).

0 replies

hammer · 2021-09-14T14:06:36Z

hammer
Sep 14, 2021
Maintainer Author

(Post by @alimanfoo)

Most of the code is in Cython at io_vcf_read.pyx. It appears to only rely on htslib for tabix support (when using the region parameter).

That's right, if region is provided then it will try to call out to a tabix binary if installed on the system, if that fails it falls back to scanning through the file, so tabix is an optional dependency.

Btw there is an open PR that works towards a native Python tabix implementation. Some good work there, but some outstanding questions and haven't had the bandwidth to shepherd that to completion. One potential benefit of having that in Python would be the ability to combine with fsspec, which could mean tabix reading of files from a variety of remote sources.

0 replies

hammer · 2021-09-14T14:07:03Z

hammer
Sep 14, 2021
Maintainer Author

pygds

GitHub
License: GPLv3

CoreArray Genomic Data Structure (GDS) files are used by the GENESIS project. See Stephanie Gogarten's presentation Introduction to GDS (2018) for more.

0 replies

hammer · 2021-09-14T14:07:16Z

hammer
Sep 14, 2021
Maintainer Author

(Post by @timothymillar)

Pyranges

GitHub
Docs
License: MIT

Is a library for fast querying of genomic intervals based on a nested containment list data structure (package 'ncls'). It has readers for many interval formats including bam, bed, gtf and gff.

Pysam

Can also be used for random access to arbitrary tabular formats that are block-gzipped and tabix indexed using the TabixFile class. TabixFile can take an optional parser which is used to parse each line into a data-structure. Pysam includes parsers for bed and gtf lines as well as a generic tuple parser.
Pysam also has a FastaFile class for random access to fasta files indexed with faidx.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python IO libraries #671

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Python IO libraries #671

hammer Sep 14, 2021 Maintainer

Replies: 11 comments

hammer Sep 14, 2021 Maintainer Author

PySnpTools

hammer Sep 14, 2021 Maintainer Author

Pysam

hammer Sep 14, 2021 Maintainer Author

bgen-reader

hammer Sep 14, 2021 Maintainer Author

BED (PySnpTools)

BGEN (bgen-reader)

VCF (Pysam)

hammer Sep 14, 2021 Maintainer Author

PyBGEN

hammer Sep 14, 2021 Maintainer Author

GCP Variant Transforms

hammer Sep 14, 2021 Maintainer Author

PyVCF

hammer Sep 14, 2021 Maintainer Author

scikit-allel

hammer Sep 14, 2021 Maintainer Author

hammer Sep 14, 2021 Maintainer Author

pygds

hammer Sep 14, 2021 Maintainer Author

Pyranges

Pysam

hammer
Sep 14, 2021
Maintainer

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author

hammer
Sep 14, 2021
Maintainer Author