DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

normanius · 2017-10-04T12:04:09Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 4), index=['A', 'B', 'C'])
df.to_pickle('out.zip')
#pd.read_pickle('out.zip')

Problem description

The below exception occurs. I do have writing permissions in the working directory. The code was working for pandas 0.19.0.

No problems observed for bz2 and gzip compression (xz I haven't tested).

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1378, in to_pickle
    df.to_pickle('tmp.zip')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/pickle.py", line 27, in to_pickle
    is_text=False)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/common.py", line 352, in _get_handle
    zip_file = zipfile.ZipFile(path_or_buf)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 756, in __init__
    self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: 'out.zip'

Expected Output

A zip file that one can re-read with pandas.read_pickle().

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: 0.26
numpy: 1.14.0.dev0+029863e
scipy: 0.18.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

normanius · 2017-10-04T12:12:41Z

The problem is located in _get_handle() of module pandas.io.common:

# ZIP Compression
elif compression == 'zip':
    import zipfile
    zip_file = zipfile.ZipFile(path_or_buf)
    zip_names = zip_file.namelist()
    if len(zip_names) == 1:
        f = zip_file.open(zip_names.pop())
    elif len(zip_names) == 0:
        raise ValueError('Zero files found in ZIP file {}'
                         .format(path_or_buf))
    else:
        raise ValueError('Multiple files found in ZIP file.'
                         ' Only one file per ZIP: {}'
                         .format(zip_names))

With this code, the zip file is opened only for reading, and not for writing. Argument mode certainly should be used somewhere.

chris-b1 · 2017-10-04T13:35:12Z

Yep, problem does seem to be not passing the correct mode, PR to fix welcome!

masongallo · 2017-10-04T20:07:51Z

It looks like the code for zip was written only for reading? Why not use gzip to write a single zip file?

BUG: use mode when opening ZipFile. pandas-dev#17778

s4chin · 2017-10-12T13:38:02Z

Can I try this? I'm looking for a first issue as an entry point.

chris-b1 · 2017-10-12T13:47:20Z

Yes, go ahead!

s4chin · 2017-10-13T08:50:13Z

mode is 'wb' when writing to the zipfile. zipfile.Zipfile only accepts 'a', 'r', 'w' as modes, hence 'wb' needs to be converted to 'w'.
After doing this, it gives me

File "pandas/io/common.py", line 369, in _get_handle
    .format(path_or_buf))
ValueError: Zero files found in ZIP file out.zip

So I just took out the if ... elif ... else part out and did f = zipfile.ZipFile(path_or_buf, 'w') which results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 1611, in to_pickle
    protocol=protocol)
  File "pandas/io/pickle.py", line 45, in to_pickle
    pkl.dump(obj, f, protocol=protocol)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 1123, in write
    st = os.stat(filename)
TypeError: must be encoded string without NULL bytes, not str

Any pointers on how to move ahead? As @masongallo said, the code looks like it was meant only for reading.

normanius · 2017-10-13T18:21:12Z

When I looked at it, I didn't find a straightforward way of doing it. The problem is that io.common._get_handle() needs to create an object with a file-like interface (read, write, open) to which you can later write strings/bytes. zipfile.ZipFile represents more a container for files than a container for strings, so not sure if it can be used like a normal file-handle.

Maybe one can construct something around ZipFile.writestr() that takes bytes instead of files to write into the zip file. This won't give you a file-handle or anything, but maybe you can tinker one using some functools or StringIO. But for this one needs to understand where the file-handle is used etc.

Alternatively follow up on @masongallo comment regarding gzip?

minggli · 2018-03-17T12:48:18Z

Hi @jreback ,

Will try to fix this issue if it hasn't been fixed since last conversation. Reverting.

Thanks,

Ming

normanius changed the title ~~DataFrame.to_pickle() fails for .zip format on MacOS and pandas 20.3~~ DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 Oct 4, 2017

chris-b1 added Bug IO Data IO issues that don't fit into a more specific label Difficulty Novice labels Oct 4, 2017

chris-b1 added this to the Next Major Release milestone Oct 4, 2017

sec147 added a commit to sec147/pandas that referenced this issue Oct 5, 2017

Update common.py

c39db5e

BUG: use mode when opening ZipFile. pandas-dev#17778

TomAugspurger added the good first issue label Oct 11, 2017

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 3, 2017

Merge branch 'master' into pandas-dev#17778

b8b7a66

ghost mentioned this issue Nov 3, 2017

BUG: #17778 - DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 reef-technologies/pandas#2

Open

4 tasks

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 6, 2017

Merge branch 'master' into pandas-dev#17778

f70c68e

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 6, 2017

Merge branch 'master' into pandas-dev#17778

5896f93

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 7, 2017

Merge branch 'master' into pandas-dev#17778

cfb27c4

ghost pushed a commit to reef-technologies/pandas that referenced this issue Nov 7, 2017

Merge branch 'master' into pandas-dev#17778

8c3d612

jreback added good first issue and removed good first issue Difficulty Novice labels Dec 15, 2017

minggli mentioned this issue Mar 17, 2018

EHN: allow zip compression in to_pickle, to_json, to_csv #20394

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 Mar 20, 2018

jreback closed this as completed in #20394 Mar 22, 2018

minggli mentioned this issue May 20, 2018

BUG: set keyword argument so zipfile actually compresses #21144

Merged

4 tasks

minggli mentioned this issue Jun 14, 2018

BUG/REG: file-handle object handled incorrectly in to_csv #21478

Merged

4 tasks

TomAugspurger mentioned this issue Jun 28, 2018

Permission Denied when writing to HDFS dask/knit#132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

normanius commented Oct 4, 2017

INSTALLED VERSIONS

normanius commented Oct 4, 2017 •

edited

Loading

chris-b1 commented Oct 4, 2017

masongallo commented Oct 4, 2017

s4chin commented Oct 12, 2017

chris-b1 commented Oct 12, 2017

s4chin commented Oct 13, 2017

normanius commented Oct 13, 2017

minggli commented Mar 17, 2018

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778

Comments

normanius commented Oct 4, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

normanius commented Oct 4, 2017 • edited Loading

chris-b1 commented Oct 4, 2017

masongallo commented Oct 4, 2017

s4chin commented Oct 12, 2017

chris-b1 commented Oct 12, 2017

s4chin commented Oct 13, 2017

normanius commented Oct 13, 2017

minggli commented Mar 17, 2018

Output of `pd.show_versions()`

normanius commented Oct 4, 2017 •

edited

Loading