-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading of large pickled dataframes fails #2705
Comments
You should also try using HDFStore (require PyTables) |
make sure you are on 0.10.1-dev
this is queryable (and preserves dtypes)
This preserves dtypes, but is not-queryable (but will write much faster)
|
I was trying to use hdfstore and my workhorse machine somehow didn't get it compiled in correctly. I can import pytables, but pandas doesn't know about it... I'll dig in on that as my workaround though. |
I was posting under the wrong account before (I am the person who started this issue), but I got HDFStore working and immediately ran into: File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498) That was on 0.10.0, I tired compiling 0.10.1.dev, but I couldn't even import pandas:
So I am still stuck. |
heard building yourself is a little tough on mac |
I was able to save smaller dataframes to a HDF5 store with 0.10.0, so that works, is there any reason to expect 0.10.1.dev will allow me to save larger ones? That problem installing 0.10.1.dev actually screwed up my entire numpy install, I had to re-install from scratch, so I'd rather not try that again. I suppose I can just write out a csv file myself with the dtypes in a header row and cast the columns myself. |
It depends on what dtypes you are tring to save. Strings are broken in 0.10.0 (they work just very slowly). Also, depending if you need query capability (e.g. you want to save as a table) - which I'll recommend because strings and other data types are more effeciently stored. can you post a df.get_dtype_counts() on your frame? |
are you storing as |
I'll add that call, though it takes hours to get to the dataframe I want to write out (hence me wanting to save it). I am storing as a
which the docs say is a I expect to pull the entire dataframe out to operate on it this is truly just a way to store intermediate processing. The dataframe definitely has strings in it. In all likelhood I'll simply write my own serializer that will tell me which columns are datetimes so I can specify those columns in pd.read_csv (for my data I have to specify the datetime columns explicitly in order to get them parsed). None of this would matter if I could get pandas to set the dtype on a datetime column correctly (it is always object for me, despite the fact I successfully parse the date). Perhaps I'll open another issue for the datetime columns to get some separate advice. I just want to say since I sound like I am complaining, I love pandas, it has been a huge help in getting to where I got with this data so quickly. |
object type is bad! definitily try to convert just a general recommendation - definitntly try to split up frames and store in separate hdf files try |
fyi...the reason many use hdf5 is just raw speed....e.g. a random 2.5GB file (1.5M rows)..its atcually a panel stored as a table for much slower than a raw df....(50 columns or so), all floats, took about 30s. In addition PyTables supports compression and utilizied multi-core to read....(and I'm just a fan of PyTables!) |
I tried convert_objects with no success. I don't know if this is the problem, but the columns with datetimes have missing values which I am putting as np.nan (when I cannot parse the date, I've tried None too). This may make it look like a "mixed" type column, which it really isn't. I just couldn't figure out what to fill in for the missing values that would connote missing but allow for proper typing. Same situation I think for the string columns, the missing values are represented as np.nan which is a float, so pandas believes they are mixed types. Do I have this right? Are there any suggestions for handling this? Pytables looks amazing, unfortunately right now I am just trying to get something done so I haven't had time to experiment, but I am glad I got it all installed so I can when I have more time. |
try this for creation (works in 0.10.0 i think)....0.10.1 has a little better handling
np.nan assignment in a datetime series will automatically get you an object type with no hope of getting out strings have much better support in the Table objects (e.g. pass table=True to put or use append). as they are represented as individual columns and nan conversion is done ( see docs at http://pandas.pydata.org/pandas-docs/dev/io.html#storing-mixed-types-in-a-table - also has an example of a NaT) - some of this might work in 0.10.0...best to use 0.10.1-dev - several bugs specifically related to nan handling were fixed in 0.10.1-dev np.nan is correct for strings.....its just that PyTables (in 0.10.0) doesn't deal well with this... |
And I just found this: #2595 So I see this is an issue... I'll try to install 0.10.1 again sometime today and cross my fingers that numpy remains :) |
if u want to hack it out |
I tried pickling a very large dataframe (20GB or so) and that succeeded to write to disk, but when I try to read it, it fails with: ValueError: buffer size does not match array size
Now I did a bit of research and found the following:
http://stackoverflow.com/questions/12060932/unable-to-load-a-previously-dumped-pickle-file-of-large-size-in-python
http://bugs.python.org/issue13555
I am thinking this is a numpy/python issue, but it does cause me pretty big pain when I want to back up a dataframe that took a long time to join together, and I want all the dtypes stored (namely what columns are datetimes). Perhaps a solution would be a csv file that keeps the dtypes somewhere (otherwise I'll have to figure out what columns are serialized dates). Any workarounds would be appreciated.
The text was updated successfully, but these errors were encountered: