Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

Closed
normanius opened this issue Oct 4, 2017 · 8 comments · May be fixed by reef-technologies/pandas#2
Closed

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

normanius opened this issue Oct 4, 2017 · 8 comments · May be fixed by reef-technologies/pandas#2
Labels
Bug good first issue IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@normanius
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4), index=['A', 'B', 'C'])
df.to_pickle('out.zip')
#pd.read_pickle('out.zip')

Problem description

The below exception occurs. I do have writing permissions in the working directory. The code was working for pandas 0.19.0.

No problems observed for bz2 and gzip compression (xz I haven't tested).

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1378, in to_pickle
    df.to_pickle('tmp.zip')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/pickle.py", line 27, in to_pickle
    is_text=False)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/common.py", line 352, in _get_handle
    zip_file = zipfile.ZipFile(path_or_buf)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 756, in __init__
    self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: 'out.zip'

Expected Output

A zip file that one can re-read with pandas.read_pickle().

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: 0.26
numpy: 1.14.0.dev0+029863e
scipy: 0.18.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@normanius normanius changed the title DataFrame.to_pickle() fails for .zip format on MacOS and pandas 20.3 DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 Oct 4, 2017
@normanius
Copy link
Author

normanius commented Oct 4, 2017

The problem is located in _get_handle() of module pandas.io.common:

# ZIP Compression
elif compression == 'zip':
    import zipfile
    zip_file = zipfile.ZipFile(path_or_buf)
    zip_names = zip_file.namelist()
    if len(zip_names) == 1:
        f = zip_file.open(zip_names.pop())
    elif len(zip_names) == 0:
        raise ValueError('Zero files found in ZIP file {}'
                         .format(path_or_buf))
    else:
        raise ValueError('Multiple files found in ZIP file.'
                         ' Only one file per ZIP: {}'
                         .format(zip_names))

With this code, the zip file is opened only for reading, and not for writing. Argument mode certainly should be used somewhere.

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 4, 2017

Yep, problem does seem to be not passing the correct mode, PR to fix welcome!

@chris-b1 chris-b1 added Bug IO Data IO issues that don't fit into a more specific label Difficulty Novice labels Oct 4, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Oct 4, 2017
@masongallo
Copy link
Contributor

It looks like the code for zip was written only for reading? Why not use gzip to write a single zip file?

sec147 added a commit to sec147/pandas that referenced this issue Oct 5, 2017
BUG: use mode when opening ZipFile. pandas-dev#17778
@s4chin
Copy link

s4chin commented Oct 12, 2017

Can I try this? I'm looking for a first issue as an entry point.

@chris-b1
Copy link
Contributor

Yes, go ahead!

@s4chin
Copy link

s4chin commented Oct 13, 2017

mode is 'wb' when writing to the zipfile. zipfile.Zipfile only accepts 'a', 'r', 'w' as modes, hence 'wb' needs to be converted to 'w'.
After doing this, it gives me

File "pandas/io/common.py", line 369, in _get_handle
    .format(path_or_buf))
ValueError: Zero files found in ZIP file out.zip

So I just took out the if ... elif ... else part out and did f = zipfile.ZipFile(path_or_buf, 'w') which results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 1611, in to_pickle
    protocol=protocol)
  File "pandas/io/pickle.py", line 45, in to_pickle
    pkl.dump(obj, f, protocol=protocol)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 1123, in write
    st = os.stat(filename)
TypeError: must be encoded string without NULL bytes, not str

Any pointers on how to move ahead? As @masongallo said, the code looks like it was meant only for reading.

@normanius
Copy link
Author

When I looked at it, I didn't find a straightforward way of doing it. The problem is that io.common._get_handle() needs to create an object with a file-like interface (read, write, open) to which you can later write strings/bytes. zipfile.ZipFile represents more a container for files than a container for strings, so not sure if it can be used like a normal file-handle.

Maybe one can construct something around ZipFile.writestr() that takes bytes instead of files to write into the zip file. This won't give you a file-handle or anything, but maybe you can tinker one using some functools or StringIO. But for this one needs to understand where the file-handle is used etc.

Alternatively follow up on @masongallo comment regarding gzip?

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 3, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 6, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 6, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 7, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 7, 2017
@minggli
Copy link
Contributor

minggli commented Mar 17, 2018

Hi @jreback ,

Will try to fix this issue if it hasn't been fixed since last conversation. Reverting.

Thanks,

Ming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue IO Data IO issues that don't fit into a more specific label
Projects
None yet
7 participants