-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: create out-of-core processing module #3202
Comments
This is the functionality that pandas currently lacks that is preventing me from ditching SAS. This is a monumental task but so worth the effort! I'm very happy to see it being discussed. Is there any possibility of collaborating with this project? http://continuum.io/blog/blaze |
The timeline for when something production-ready is going to ship there is unclear but worth keeping an eye on for collaboration possibilities. |
@Zelazny7 could u put up some example calculations? |
For example, finding the within group sum for a dataframe could be broken up iteratively like so. This is all done in memory, but the same concept could be applied to a file stored on disk and read-in piece-meal. The purpose of this enhancement, I hope, is to make the management of the IO and chunking transparent to the user: def gen(df, chunksize=50000):
nrows = len(df)
current = 0
while current < nrows:
yield df[current:current+chunksize]
current += chunksize
res = pd.concat([chunk.groupby('YearMade').sum() for chunk in gen(df)])
res.groupby(res.index).sum().head()
SalePrice
YearMade
1000 781142325
1919 2210750
1920 493250
1937 18000
1942 51000 This is equivalent to: df.groupby('YearMade').sum().head()
SalePrice
YearMade
1000 781142325
1919 2210750
1920 493250
1937 18000
1942 51000 I tried using def test():
res = pd.concat([chunk.groupby('YearMade').sum() for chunk in gen(df)])
res.groupby(res.index).sum()
In [1]: %timeit test()
10 loops, best of 3: 31.2 ms per loop
In [2]: %timeit df.groupby('YearMade').sum()
10 loops, best of 3: 32.8 ms per loop There is a broad class of calculations that can be performed on chunks of the data and the results collated together. The sum example shown above is one such case. This can easily be extended to the mean, variance and standard deviation. Median's and quantiles are a bit trickier and certain kinds of regression would have to fall back on gradient descent algorithms rather than close-form solutions. Edited to add that scikit-learn has several algorithms that implement the BTW, I pulled the data from Kaggle: http://www.kaggle.com/c/bluebook-for-bulldozers |
@Zelazny7 I think we could start off with an API something like this (pseudo codeish here): Create pipes
Eval a pipe (and return a new pipe of this computation)
Get results
Cleanup (delete temporary files), how to make this cleanup nicely
the reason for the I think could support methods of so I think that we could right your computation like this
if df happens to be on disk
Alternatively
Could do some of all three We usually use any suggestions for names, pipeline just sort of random.... |
This probably also should be done with a decorator, mainly to facilitate open/close context of the file http://docs.celeryproject.org/en/latest/getting-started/introduction.html
Then this will work (I think)
|
having Pytables evaluator, can provide simple computations thru this directly
numpy w/multi-processing
things to take note:
mclapply in R Creating shared memory with multiprocessing
from Here: http://mail.scipy.org/pipermail/scipy-user/2013-April/034415.html Another good Ref |
answered in this question: 2 methods for groupby
|
Functionality like the ff package from R would go a long ways towards making pandas easier to use with large-ish data: http://cran.r-project.org/web/packages/ff/index.html HDFStore is great, but not being able to transparently select columns and, more importantly, append new columns makes it difficult to use efficiently. ff essentially stores a dataframe on disk. Users can bring portions of the dataframe into memory using the same syntax as an in-memory dataframe. This is why I call it transparent. Furthermore, it is very easy to "write" a new column by using the same syntax to assign a new column to an in-memory dataframe. This functionality alone - being able to access elements of an on-disk dataframe - would be a huge win for pandas and could be implemented more quickly than the sequential processes discussed above. I think this would be a good stepping stone on the way towards out-of-core computations. I think it's reasonable for users to pull subsets of their data into memory for manipulation. It becomes a pain when we have to manage the i/o operations. This would alleviate that pain-point. |
you can now select columns easily adding columns efficiently is quite difficult, you would need to store data column wise, and then adding rows would be difficult. You have to pick. (that said, we possibily could support a columnar access, via That said you can easily add columns if you disaggregate your table, e.g. store a column per group. The functionarily your are asking for here is bascially a numpy |
In The predictive modeling world, think FICO and other credit scores, it's pretty much always a big table of heterogenous data. I would love to use pandas like this: # read in a csv (or other text file) using awesome pandas parser, but store on disk
df_on_disk = pd.read_csv('somefile.csv', dtypes=dtypes, mode='disk')
# select columns into memory
col_in_mem = df_on_disk['col_that_I_need']
# multiple columns
cols_in_mem = df_on_disk[['col1','col2','col8']]
# writing to disk using column in memory
modified_col_in_mem = cool_function(col_in_mem)
df_on_disk['new_col'] = modified_col_in_mem All of the cool indexing that pandas handles in-memory could be ignored for the on-disk dataframe. As long as it has the same number of rows, a new column would be slapped onto the end. |
I'm starting to understand the difficulty of quickly accessing and storing columns like this. To accomplish it using pd.read_csv, one loop for every column would have to be performed. Still, it would only have to be done once. The columns would have to be written to a mmap file in sequential order. Storing meta data about the columns and types would allow the mmap to be quickly accessed and return pandas objects. I think I"ll toy around with creating something like this and see how it goes. |
If I remember how your original question on SO was, you just need to write the table (as is)) to HDFStore, with whatever data columnsyou want (or multiple tables if its very wide), then you just write columns to another group/groups as needed. There is some bookeeping in order to track what is where, so when you do a query you are pulling in the correct groups (and not selecting on certain ones), but this should be quite fast. And every once in a while you could rebuild the store into a new one (conceputally 'packing' it). Its kind of like a distributed file system, you write and the system stores it where convenient. It has a map to tell you where everything is (kind of like a table of contents). The user of the system doesn't care exactly where the data is, just that they can do operations on it. Appends of rows are a bit more complicated because you have to split the row and append to multipel groups. Column appends are also easy. Even deletes of columns can be handled (just remember that column is deleted). You ideally want locality of data when you select, e.g. if you typically want a bunch of columns at the same time, then store them in a table together. I think could but a wrapper around this and make it pretty transparent. Wrap the selection mechnasim and it you could effectively have an out-of-core dataframe. |
@Zelazny7 I lost the question where we were talking about column-wise tables... you might find this link interesting I actually think could implement this directly (and not wait for pytables to do it) via |
That does seem promising. Having an on-disk, column-oriented datastore for pandas would be a huge improvement for a large class of use-cases. Even a simple interface for storing/appending/retrieving would be enough to start using pandas in earnest (again for my use case). In the meantime, I've had to switch to R with the FF package. It is pretty much exactly what is described. The FF package only handles the storage and retrieval of R column vectors. This is a nice addition on it's own as it allows the user to choose what is brought into memory. However, another project quickly created the FFBase packaged which creates special forms of all the commonly used functions that specifically work on the on-disk, FF objects. Perhaps a similar approach could be taken with Pandas. I wish I was a good enough programmer to help develop, but I can definitely help test any ideas and provide user feedback. Thanks again for your hard work. |
@Zelazny7 I'm not sure a column oriented database is a complete solution for the use-case you mentioned - df.groupby('YearMade').sum().head() when df is very large. For example, you would still run into the issue if each column is very large A proposal for my use cases, which may also cover yours Assume a csv with columns ['col1', 'col2', 'col3' ...] To store this to file you would do
This would store it as follows:
This would then allow you to the following:
This is a first cut API. If you cache the 'chunk map' as metadata then, other interesting possibilities emerge
Making row/column insertion transparent might be tricky, but possible Given the huge effort required in an overhaul like blaze, the effort required for this might be considerably smaller Caveat: My python/big data knowledge is small but growing :) So if I've missed something obvious, feel free not to mince words |
you can already do this with the csv iterator (or even better with the HDF iterator), both are row-oriented see docs http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk, and http://pandas.pydata.org/pandas-docs/dev/io.html#iterator @Zelazny7 is working specifically with column oriented data, in that you need to add/delete columns which row-oriented doesn't support as well, because you a) need to track the adds/deletes (and have a layer that allows you to do this transparently), and b) brings in ALL data in a row whenever you bring that row in; in a very wide table this can be somewhat inefficient, a partial solution is this: http://pandas.pydata.org/pandas-docs/dev/io.html#multiple-table-queries what is the problem you are trying to address? |
As already noted (in your case a), reading chunk by chunk does not solve adding/deleting columns I was trying to point out that a column oriented data base by itself does not completely solve the problem of large data sets (as column might be too large to load into memory). Hence, you need to chunk ROWS AS WELL AS COLUMNS IMO with large datasets, the challenge is to how to 'chunk' the data. If there were a generic way to do it like proposed in blaze (my opinion comes from this - http://vimeo.com/53031980) that would be great But allowing a user who is knowledgeable about the dataset to specify dimensions in which he would want to do this would be best. To explain, Scenario 1: A user like Zelazny7 might want to 'chunk' the dataset merely by row number i.e. 10000 rows etc As far as case (b) goes, the API I suggested would clearly help as it could be extended to allow the user to specify which of the columns she wants passed to the function IMO providing an API like above, although not generic would be easier to implement and hence, have a larger impact quicker. |
what you are suggesting is quite straightforward with HDFStore today. Write your dataset as a table, and use an iterator (possibly with a query). You will then chunk by rows (or by whatever is in your query). I do this all the time by lining my data up in the appropriate way. Using your example I keep a panel of: I also do this is a 4-d case, providing arbitrary axes support. See http://pandas.pydata.org/pandas-docs/dev/io.html#experimental. |
To be honest, I haven't paid much attention to HDFStore thus far. I'll go through it and seem if it fits the needs. Thanks |
@jreback - I briefly reviewed HDFStore and it seems to me that its applicable to out-of-core computing more that multi-process or multi-host processing. For example, having centralized metadata is going to be challenging when you have multiple parallel writers. Is there a way to do this with HDFStore? Or maybe there is a way to split the dataset over multiple files? |
Parallel writes are not supported (the docs warn you away from this quite a bit), but you can easily read in parallel. It is quite straightfoward to avoid parallel writes, however; if you need a combine step, then do that synchronously and continue on. I can often parrallelize (either multi-core of distributed) processes. collect the results, combine, then spin off more processes to do the next step. This is a fairly typical way of doing things. HDFStore could easily support this method of computation. So I am not exactly sure what you are asking. |
I don't follow
When you parallelize the task, how do you store the intermediate results and combine them? One way would be to store them as different files and then combine them on read. Is that the suggestion? |
yes, that's exactly what you would do. I don't know exactly what problem you are attempting, but for instance:
1 and 3 are not parallellizable, but 2 could be distribution, multi-core, or single-core This is essentially a map-reduce paradigm and here's my 2c. solve your problem in a general way single-core, then , and only if its not efficient, distribute/multi-core. too often people are prematurely optimizing! |
OK, what you describe is close to the workflow I have. One of the differences is I use csv files instead of HDFStore The problems I have with this workflow are:
I think it'll be a great addition to pandas to provide a transparent way to achieve this. In short, an out of the box interface that allows this would be great |
the intent of this issue was to provide discussion related to the features of this, to in fact make it easier. This is a non-trivial problem as data computations of this nature can take several different forms. I think supporting csv would be nice, but certainly won't be on the first go (nor would you really want it anyhow, for a big file this is very inefficient, you need to be able to slice chunks out of it - good thing is that this is easy to fix but splitting the file or putting in HDF5) what do you mean by multiple backends? you really mean multiple methods of computation / distribution, correct? what would help is a pseudo code like specification (may with some sample data) to get a good idea of what different workflows are |
@jreback I also briefly reviewed this paper http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf IMO, a lot of it's complexity has to do with the fact that the data they deal with is sparse. If we assume the data we deal with is dense, Figure 1 proposes a 'columnar representation of nested data'. I find it very interesting. I understand it might have some similarities to HDFStore, but the differences I see are:
As far as the API goes, let's assume a simple financial data problem Now, I want to calculate VWAP (volume weighted average price per stock/per day) Usually there are ~8000 stocks, ~250 million rows/day, 60+ days of data Ideally, I still want to keep a similar interface def vwap(sdf): return (sdf.price * sdf.qty)/sdf.qty.sum() res = df.groupby(['date', 'stock']).apply(vwap, [optional arg(s) for distribution]) So
|
In my case, the approach presented here using HDF stores is not working when one need to do computations on a whole column or a whole line from a huge store since Pandas will retrieve the row/column and store it in memory to do the processing. |
retrieveing only 1 value at a time is extremely inefficient. you can certainly slice a This trick is that only a small group of computations can actually be done this way (e.g. even The point of this PR is to wrap some of this functionality up in a way that is more transparent to the user. |
Just wanted to enthusiastically +1 this. Might be helpful also to take a look at the following aspects of the blaze project: |
+1 to suscribe to Github mail alert. |
closing as this is just |
Conceptually create a pipeline processor that performs out-of-core computation.
This is easily parallelizable (multi-core or machines), in theory cython / ipython / joblib / hadoop could operate with this
requirements
the data set must support chunking, and the function must operate only on that chunk
Useful in cases of a large number of rows, or a problem that you want to parrallelize.
input
a chunking iterator that reads from disk (could take chunksize parameters,
a handle and just call the iterators as well)
read_csv
HDFStore
function
take an iterated chunk, and an axis and return another pandas object
(could be a reduction, transformation, whatever)
output
an output mechanism to take the function application, must support appending
to_csv
(with appending)HDFStore
(table)Map-reduce is an example of this type of pipelining.
Interesting Library
https://github.com/enthought/distarray
The text was updated successfully, but these errors were encountered: