Python IO libraries #671
Replies: 11 comments
-
PySnpTools
This library was initially authored by Microsoft Research and now appears to be maintained by Carl Kadie (@CarlKCarlK on GitHub), a former Microsoft Research employee who is now retired, according to his LinkedIn profile. |
Beta Was this translation helpful? Give feedback.
-
Pysam
Over a decade old! Andreas Heger (@AndreasHeger on GitHub) seems to be the original author and is still active in the repo. According to his LinkedIn profile, he's now at Genomics plc. John Marshall (@jmarshall) is also active and from his LinkedIn profile he appears to work in cancer genomics in Glasgow. The workhorse module in this library appears to be libcbcf.pyx, a Cython wrapper for |
Beta Was this translation helpful? Give feedback.
-
bgen-reader
A library from Oliver Stegle's group at the EMBL Heidelberg. Danilo Horta (@horta on GitHub) is the primary developer of both the C library and the Python wrapper. |
Beta Was this translation helpful? Give feedback.
-
(Post by @tomwhite) BED (PySnpTools)From the point of view of splitting (being able to read a file in parallel, by reading a chunk without having to read all the bytes in the file up to the start of a chunk), BED files are simple to read. BED files are not compressed, and when stored in "SNP-major" form (the default) all the samples for each SNP are stored together, 4 genotypes per byte, aligned on byte boundaries. So if there are M samples, then each SNP (row) is encoded in exactly In terms of testing on real-world PLINK files, a "minitest" for basic reading and writing was added recently, which might be a good place to contribute to. Also, some generated (synthetic) tests were added at the end of last year. BGEN (bgen-reader)For splitting: BGEN is effectively “SNP-major”, in that each variant is stored in turn in a (possibly compressed) data block. The offsets of these data blocks is not stored in one place in the file, so without an index it is not possible to easily split the file into chunks. The bgenix tool defines a BGEN index format that is stored as a sqlite file. (This index format is used by Glow, but not by Hail, which uses its own index format.) The bgen library doesn’t use bgenix, but instead creates a “metafile” the first time the file is loaded, which divides the BGEN file into a set number of partitions, which can later be read independently. The bgen-reader-py library returns a dask dataframe, using dask delayed to load partitions. There would be some work to change this to load xarray instead. For testing, bgen-reader-py stores example data on S3, which would make it feasible to contribute test samples at some point. VCF (Pysam)For splitting, since they are plain text VCF files can be trivially split on newline boundaries. However, VCF files are usually compressed using bgzip, which can be split using heuristics, or by using an index file, such as the .gzi index. I would recommend creating an index (a one-off task), since it uses existing tools and avoids introducing another source of hard-to-debug bugs. I'm not sure how easy it would be to use Pysam to read a bgzipped VCF in chunks to load into xarray/dask (i.e. I'm not sure if it's possible out-of-the-box, or if Pysam would need any modification). From a testing point of view, Pysam uses htslib, which is very popular and well-tested in general, although .gzi is fairly niche so might need some attention. |
Beta Was this translation helpful? Give feedback.
-
PyBGENLouis-Philippe Lemieux Perreault (@lemieuxl) is the sole developer, working at the Beaulieu-Saucier Université de Montréal Pharmacogenomics Centre (@pgxcentre). He has also developed lemieuxl/pyplink and pgxcentre/geneparse, a library with a unified interface to parsing PLINK, BGEN, VCF, and other file formats. Comments from @tomwhite's gwas-analysis#36 PR:
Additional comments from gwas-analysis#38:
|
Beta Was this translation helpful? Give feedback.
-
GCP Variant TransformsThis library, developed by Google, uses the Python SDK for Apache Beam to round trip variant data from VCF to BigQuery and back. It includes a VCF preprocessor that could be useful for us. The current maintainer is Saman Vaisipour (@samanvp). Currently the code is in Python 2 (yeesh) but they claim to be converting it to Python 3 within the next few months. They also claim to be targeting Avro as the intermediate file format before loading into BigQuery. |
Beta Was this translation helpful? Give feedback.
-
PyVCF
No commits since early 2017. Seems to have been primarily developed by Aaron Quinlan and Martijn Vermaat in 2012. It's unclear why James Casbon owns the repo, as I don't see any commits from him? PyVCF#314 points users to brentp/cyvcf2. |
Beta Was this translation helpful? Give feedback.
-
scikit-allel@alimanfoo nicely documents his Most of the code is in Cython at io_vcf_read.pyx. It appears to only rely on |
Beta Was this translation helpful? Give feedback.
-
(Post by @alimanfoo)
That's right, if Btw there is an open PR that works towards a native Python tabix implementation. Some good work there, but some outstanding questions and haven't had the bandwidth to shepherd that to completion. One potential benefit of having that in Python would be the ability to combine with fsspec, which could mean tabix reading of files from a variety of remote sources. |
Beta Was this translation helpful? Give feedback.
-
pygds
CoreArray Genomic Data Structure (GDS) files are used by the GENESIS project. See Stephanie Gogarten's presentation Introduction to GDS (2018) for more. |
Beta Was this translation helpful? Give feedback.
-
(Post by @timothymillar) PyrangesIs a library for fast querying of genomic intervals based on a nested containment list data structure (package 'ncls'). It has readers for many interval formats including bam, bed, gtf and gff. PysamCan also be used for random access to arbitrary tabular formats that are block-gzipped and tabix indexed using the TabixFile class. TabixFile can take an optional parser which is used to parse each line into a data-structure. Pysam includes parsers for bed and gtf lines as well as a generic tuple parser. |
Beta Was this translation helpful? Give feedback.
-
Let's use this topic to collect useful libraries for reading and writing data stored in files with formats commonly used by the statistical genetics community.
Import
andExport
rowsBeta Was this translation helpful? Give feedback.
All reactions