Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading of large pickled dataframes fails #2705

Closed
daggre-gmu opened this issue Jan 17, 2013 · 15 comments
Closed

Loading of large pickled dataframes fails #2705

daggre-gmu opened this issue Jan 17, 2013 · 15 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label

Comments

@daggre-gmu
Copy link

I tried pickling a very large dataframe (20GB or so) and that succeeded to write to disk, but when I try to read it, it fails with: ValueError: buffer size does not match array size

Now I did a bit of research and found the following:

http://stackoverflow.com/questions/12060932/unable-to-load-a-previously-dumped-pickle-file-of-large-size-in-python

http://bugs.python.org/issue13555

I am thinking this is a numpy/python issue, but it does cause me pretty big pain when I want to back up a dataframe that took a long time to join together, and I want all the dtypes stored (namely what columns are datetimes). Perhaps a solution would be a csv file that keeps the dtypes somewhere (otherwise I'll have to figure out what columns are serialized dates). Any workarounds would be appreciated.

@wesm
Copy link
Member

wesm commented Jan 17, 2013

You should also try using HDFStore (require PyTables)

@jreback
Copy link
Contributor

jreback commented Jan 17, 2013

make sure you are on 0.10.1-dev

store = pd.HDFStore('my_large_frame.h5','w')

this is queryable (and preserves dtypes)

store.append('df',df)

This preserves dtypes, but is not-queryable (but will write much faster)

store.put('df',df)

@daggre-gmu
Copy link
Author

I was trying to use hdfstore and my workhorse machine somehow didn't get it compiled in correctly. I can import pytables, but pandas doesn't know about it... I'll dig in on that as my workaround though.

@jostheim
Copy link

I was posting under the wrong account before (I am the person who started this issue), but I got HDFStore working and immediately ran into:

File "hdf5Extension.pyx", line 884, in tables.hdf5Extension.Array._createArray (tables/hdf5Extension.c:8498)
tables.exceptions.HDF5ExtError: Problems creating the Array.

That was on 0.10.0, I tired compiling 0.10.1.dev, but I couldn't even import pandas:

import pandas
numpy.dtype has the wrong size, try recompiling
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pandas-0.10.1.dev_6e2b6ea-py2.7-macosx-10.8-intel.egg/pandas/init.py", line 6, in
from . import hashtable, tslib, lib
File "numpy.pxd", line 156, in init pandas.hashtable (pandas/hashtable.c:20380)
ValueError: numpy.dtype has the wrong size, try recompiling

So I am still stuck.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

heard building yourself is a little tough on mac
did you try this: http://ericjang.tumblr.com/post/25096909713/annoying-pytables-build-on-mac-osx-10-7
from your error looks like pandas and pytables built against different numpy

@jostheim
Copy link

I was able to save smaller dataframes to a HDF5 store with 0.10.0, so that works, is there any reason to expect 0.10.1.dev will allow me to save larger ones?

That problem installing 0.10.1.dev actually screwed up my entire numpy install, I had to re-install from scratch, so I'd rather not try that again.

I suppose I can just write out a csv file myself with the dtypes in a header row and cast the columns myself.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

It depends on what dtypes you are tring to save. Strings are broken in 0.10.0 (they work just very slowly). Also, depending if you need query capability (e.g. you want to save as a table) - which I'll recommend because strings and other data types are more effeciently stored.

can you post a df.get_dtype_counts() on your frame?

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

are you storing as put or append? are you anticpating reading the entire frame into memory for operations later on?

@jostheim
Copy link

I'll add that call, though it takes hours to get to the dataframe I want to write out (hence me wanting to save it).

I am storing as a put currently:

store['blah'] = df

which the docs say is a put.

I expect to pull the entire dataframe out to operate on it this is truly just a way to store intermediate processing. The dataframe definitely has strings in it.

In all likelhood I'll simply write my own serializer that will tell me which columns are datetimes so I can specify those columns in pd.read_csv (for my data I have to specify the datetime columns explicitly in order to get them parsed). None of this would matter if I could get pandas to set the dtype on a datetime column correctly (it is always object for me, despite the fact I successfully parse the date). Perhaps I'll open another issue for the datetime columns to get some separate advice.

I just want to say since I sound like I am complaining, I love pandas, it has been a huge help in getting to where I got with this data so quickly.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

object type is bad! definitily try to convert

just a general recommendation - definitntly try to split up frames and store in separate hdf files
easier to debug and inspect things. esp when I am doing something new, I make lots of intermediate data steps (where I save data); you can always combine later.

try df.convert_objects() to convert your dates
take a look at this question as well: http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

fyi...the reason many use hdf5 is just raw speed....e.g. a random 2.5GB file (1.5M rows)..its atcually a panel stored as a table for much slower than a raw df....(50 columns or so), all floats, took about 30s. In addition PyTables supports compression and utilizied multi-core to read....(and I'm just a fan of PyTables!)

@jostheim
Copy link

I tried convert_objects with no success. I don't know if this is the problem, but the columns with datetimes have missing values which I am putting as np.nan (when I cannot parse the date, I've tried None too). This may make it look like a "mixed" type column, which it really isn't. I just couldn't figure out what to fill in for the missing values that would connote missing but allow for proper typing. Same situation I think for the string columns, the missing values are represented as np.nan which is a float, so pandas believes they are mixed types.

Do I have this right? Are there any suggestions for handling this?

Pytables looks amazing, unfortunately right now I am just trying to get something done so I haven't had time to experiment, but I am glad I got it all installed so I can when I have more time.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

try this for creation (works in 0.10.0 i think)....0.10.1 has a little better handling

import pandas as pd

# objvious you can also use Timestamp objects as well
s = Series([datetime(2001, 1, 2, 0, 0), pd.tslib.iNaT], dtype='M8[ns]') 

np.nan assignment in a datetime series will automatically get you an object type with no hope of getting out
(though in 0.10.1 you have a datetime64[ns] and assign a np.nan it will work)

strings have much better support in the Table objects (e.g. pass table=True to put or use append). as they are represented as individual columns and nan conversion is done ( see docs at http://pandas.pydata.org/pandas-docs/dev/io.html#storing-mixed-types-in-a-table - also has an example of a NaT) - some of this might work in 0.10.0...best to use 0.10.1-dev - several bugs specifically related to nan handling were fixed in 0.10.1-dev

np.nan is correct for strings.....its just that PyTables (in 0.10.0) doesn't deal well with this...

@jostheim
Copy link

And I just found this: #2595

So I see this is an issue... I'll try to install 0.10.1 again sometime today and cross my fingers that numpy remains :)

@jreback
Copy link
Contributor

jreback commented Jan 18, 2013

if u want to hack it out
there are no cython changes in this so u could just update the python code in order to test it (its not that much)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants