-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using pandas as the in memory data store #89
Comments
Any sense of how the memory footprint would scale with pandas? |
I really like this idea! However, Aaron raises a valid concern. I think pandas really needs everything in memory (I don't know of a way to use numpy.memmap as a backend, but I also think it wouldn't really help us here). This is of course not an option for very large files (e.g. 1KG files can easily be > 25G) and certainly if you just want to parse such files to store the data elsewhere you don't gain anything by loading it all in one datastructure first. To make things worse, I don't know of a good way (aside from very inefficient pandas.concat) to construct a DataFrame iteratively. James' example reads everything in memory first and I see no way around that (but I'm no pandas expert). See e.g. pandas-dev/pandas#2305. Perhaps another way of going at this is to make it easy to get your data into pandas, without using it as our backend. |
This feels like a layer on top of raw parsing? The notebook shows some nice examples of the power and simplicity of the pandas framework, though. The lack of a fast iterative DataFrame is a characteristic shared with R. In Bioconductor, we use chunked iterators, but I personally find the paradigm a bit clunky at times. |
Here is my current thinking. We don't support fetching less than a single line of vcf, and I don think we need too. So perhaps when we parse a line we can return a numpy array where the ij element is the data from sample i format j, which may itself be an array. This is probably the lowest possible memory use from representing that data. Existing method access to the data can be a view on the array. We would then offer an API that can stitch these blocks together into a pandas data frame. Issues: 1 the array is actually ragged, in that a vcf entry may not need to define all the formats in the format field. This can be achieved with masked arrays, but is more complex and less natural.
|
Right now we move everything in pandas from the VCF and a similar way that @jamescasbon propose above. The memory goes up, quite quickly, however we didn't try to pick a dtype for the object, leaving pandas to guess whatever was closer I was wondering if there is any kind of news here? |
If there is any continued interest in this, it is possible to create a "lazy" dataframe by using chunking: An example applied to VCF files: |
I like the VCF chunk idea. I think these are worth exploring. |
+1 for pandas or dask. There are partial implementations of these in scikit-allel, for reference. Here's another approach that assumes the 1 or 2 samples of interest are known up front, then builds a pandas DataFrame from the limited fields of interest using PyVCF. |
Just passing by... |
Load and save pandas data with memmapping: https://gist.github.com/luispedro/7887214 |
Please see: http://nbviewer.ipython.org/4678244/
Advantages are the ability to do many fast numpy based operations, output to many different formats (hdf5, csv, excel).
Requires we rewrite the parser quite extensively, ideally to output numpy structures.
I'm very tempted.
The text was updated successfully, but these errors were encountered: