Consider using pandas as the in memory data store #89

jamescasbon · 2013-01-30T23:19:30Z

Please see: http://nbviewer.ipython.org/4678244/

Advantages are the ability to do many fast numpy based operations, output to many different formats (hdf5, csv, excel).

Requires we rewrite the parser quite extensively, ideally to output numpy structures.

I'm very tempted.

arq5x · 2013-01-30T23:56:23Z

Any sense of how the memory footprint would scale with pandas?

martijnvermaat · 2013-01-31T09:29:07Z

I really like this idea! However, Aaron raises a valid concern.

I think pandas really needs everything in memory (I don't know of a way to use numpy.memmap as a backend, but I also think it wouldn't really help us here). This is of course not an option for very large files (e.g. 1KG files can easily be > 25G) and certainly if you just want to parse such files to store the data elsewhere you don't gain anything by loading it all in one datastructure first.

To make things worse, I don't know of a good way (aside from very inefficient pandas.concat) to construct a DataFrame iteratively. James' example reads everything in memory first and I see no way around that (but I'm no pandas expert). See e.g. pandas-dev/pandas#2305.

Perhaps another way of going at this is to make it easy to get your data into pandas, without using it as our backend.

seandavi · 2013-01-31T12:01:10Z

This feels like a layer on top of raw parsing? The notebook shows some nice examples of the power and simplicity of the pandas framework, though.

The lack of a fast iterative DataFrame is a characteristic shared with R. In Bioconductor, we use chunked iterators, but I personally find the paradigm a bit clunky at times.

jamescasbon · 2013-02-14T08:44:01Z

Here is my current thinking. We don't support fetching less than a single line of vcf, and I don think we need too. So perhaps when we parse a line we can return a numpy array where the ij element is the data from sample i format j, which may itself be an array. This is probably the lowest possible memory use from representing that data. Existing method access to the data can be a view on the array. We would then offer an API that can stitch these blocks together into a pandas data frame.

Issues:

1 the array is actually ragged, in that a vcf entry may not need to define all the formats in the format field. This can be achieved with masked arrays, but is more complex and less natural.

What dtype for the GT entry?
Variable length arrays in the format. Maybe the object dtype can be used here.
You can't pass an array that has nested arrays straight into pandas. It would need to be reprocessed into several blocks.

mattions · 2014-12-16T10:42:58Z

Right now we move everything in pandas from the VCF and a similar way that @jamescasbon propose above. The memory goes up, quite quickly, however we didn't try to pick a dtype for the object, leaving pandas to guess whatever was closer

I was wondering if there is any kind of news here?

averagehat · 2015-05-04T15:37:36Z

If there is any continued interest in this, it is possible to create a "lazy" dataframe by using chunking:
http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk

An example applied to VCF files:
http://nbviewer.ipython.org/github/erscott/pandasVCF/blob/master/ipynb/single_sample_ex.ipynb

mattions · 2015-05-27T10:16:48Z

I like the VCF chunk idea. I think these are worth exploring.

etal · 2016-11-18T00:44:52Z

+1 for pandas or dask. There are partial implementations of these in scikit-allel, for reference.

Here's another approach that assumes the 1 or 2 samples of interest are known up front, then builds a pandas DataFrame from the limited fields of interest using PyVCF.

RGD2 · 2016-12-13T23:26:13Z

Just passing by...
looks like one can give pandas a memmapped array behind its handy functionality.
I'll just leave this here: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5
Goodbye and good luck!

RGD2 · 2016-12-13T23:31:00Z

Load and save pandas data with memmapping: https://gist.github.com/luispedro/7887214
Comprehensive guide to 'out of memory analytics' with pandas using a memory mapped file-backed array: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5

mattions mentioned this issue Dec 16, 2014

Consider using a parser combinator approach #90

Open

averagehat mentioned this issue Jul 14, 2015

Add functionality for Variant Call Format files scikit-bio/scikit-bio#841

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using pandas as the in memory data store #89

Consider using pandas as the in memory data store #89

jamescasbon commented Jan 30, 2013

arq5x commented Jan 30, 2013

martijnvermaat commented Jan 31, 2013

seandavi commented Jan 31, 2013

jamescasbon commented Feb 14, 2013

mattions commented Dec 16, 2014

averagehat commented May 4, 2015

mattions commented May 27, 2015

etal commented Nov 18, 2016

RGD2 commented Dec 13, 2016

RGD2 commented Dec 13, 2016

Consider using pandas as the in memory data store #89

Consider using pandas as the in memory data store #89

Comments

jamescasbon commented Jan 30, 2013

arq5x commented Jan 30, 2013

martijnvermaat commented Jan 31, 2013

seandavi commented Jan 31, 2013

jamescasbon commented Feb 14, 2013

mattions commented Dec 16, 2014

averagehat commented May 4, 2015

mattions commented May 27, 2015

etal commented Nov 18, 2016

RGD2 commented Dec 13, 2016

RGD2 commented Dec 13, 2016