Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using pandas as the in memory data store #89

Open
jamescasbon opened this issue Jan 30, 2013 · 10 comments
Open

Consider using pandas as the in memory data store #89

jamescasbon opened this issue Jan 30, 2013 · 10 comments

Comments

@jamescasbon
Copy link
Owner

Please see: http://nbviewer.ipython.org/4678244/

Advantages are the ability to do many fast numpy based operations, output to many different formats (hdf5, csv, excel).

Requires we rewrite the parser quite extensively, ideally to output numpy structures.

I'm very tempted.

@arq5x
Copy link

arq5x commented Jan 30, 2013

Any sense of how the memory footprint would scale with pandas?

@martijnvermaat
Copy link
Collaborator

I really like this idea! However, Aaron raises a valid concern.

I think pandas really needs everything in memory (I don't know of a way to use numpy.memmap as a backend, but I also think it wouldn't really help us here). This is of course not an option for very large files (e.g. 1KG files can easily be > 25G) and certainly if you just want to parse such files to store the data elsewhere you don't gain anything by loading it all in one datastructure first.

To make things worse, I don't know of a good way (aside from very inefficient pandas.concat) to construct a DataFrame iteratively. James' example reads everything in memory first and I see no way around that (but I'm no pandas expert). See e.g. pandas-dev/pandas#2305.

Perhaps another way of going at this is to make it easy to get your data into pandas, without using it as our backend.

@seandavi
Copy link

This feels like a layer on top of raw parsing? The notebook shows some nice examples of the power and simplicity of the pandas framework, though.

The lack of a fast iterative DataFrame is a characteristic shared with R. In Bioconductor, we use chunked iterators, but I personally find the paradigm a bit clunky at times.

@jamescasbon
Copy link
Owner Author

Here is my current thinking. We don't support fetching less than a single line of vcf, and I don think we need too. So perhaps when we parse a line we can return a numpy array where the ij element is the data from sample i format j, which may itself be an array. This is probably the lowest possible memory use from representing that data. Existing method access to the data can be a view on the array. We would then offer an API that can stitch these blocks together into a pandas data frame.

Issues:

1 the array is actually ragged, in that a vcf entry may not need to define all the formats in the format field. This can be achieved with masked arrays, but is more complex and less natural.

  1. What dtype for the GT entry?
  2. Variable length arrays in the format. Maybe the object dtype can be used here.
  3. You can't pass an array that has nested arrays straight into pandas. It would need to be reprocessed into several blocks.

@mattions
Copy link

Right now we move everything in pandas from the VCF and a similar way that @jamescasbon propose above. The memory goes up, quite quickly, however we didn't try to pick a dtype for the object, leaving pandas to guess whatever was closer

I was wondering if there is any kind of news here?

@averagehat
Copy link

If there is any continued interest in this, it is possible to create a "lazy" dataframe by using chunking:
http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk

An example applied to VCF files:
http://nbviewer.ipython.org/github/erscott/pandasVCF/blob/master/ipynb/single_sample_ex.ipynb

@mattions
Copy link

I like the VCF chunk idea. I think these are worth exploring.

@etal
Copy link

etal commented Nov 18, 2016

+1 for pandas or dask. There are partial implementations of these in scikit-allel, for reference.

Here's another approach that assumes the 1 or 2 samples of interest are known up front, then builds a pandas DataFrame from the limited fields of interest using PyVCF.

@RGD2
Copy link

RGD2 commented Dec 13, 2016

Just passing by...
looks like one can give pandas a memmapped array behind its handy functionality.
I'll just leave this here: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5
Goodbye and good luck!

@RGD2
Copy link

RGD2 commented Dec 13, 2016

Load and save pandas data with memmapping: https://gist.github.com/luispedro/7887214
Comprehensive guide to 'out of memory analytics' with pandas using a memory mapped file-backed array: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants