-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN: allow zip compression in to_pickle
, to_json
, to_csv
#20394
Conversation
to_pickle
to produce zip compressed pickleto_pickle
to_pickle
to_pickle
, to_json
, to_csv
@@ -425,6 +428,24 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None, | |||
return f, handles | |||
|
|||
|
|||
class BytesZipFile(ZipFile, BytesIO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like this location. I would keep it here.
Codecov Report
@@ Coverage Diff @@
## master #20394 +/- ##
==========================================
+ Coverage 91.77% 91.79% +0.01%
==========================================
Files 152 152
Lines 49205 49233 +28
==========================================
+ Hits 45159 45191 +32
+ Misses 4046 4042 -4
Continue to review full report at Codecov.
|
ensure_clean() somehow fails to randomize on one of the 2.7 configuration with xdist. strange. when pickle seems to work different in Python 2 and not seem to be way around it. making zip compression only available for Python>=3.x |
1b5fc77
to
4ac9488
Compare
zip compression for pickle now should work in Python 2 now as well as Python 3, so does zip compression for json. However, csv zip compression works only in Python 3, not Python 2. |
pandas/tests/frame/test_to_csv.py
Outdated
|
||
df = DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']) | ||
|
||
if PY2 and compression == 'zip': | ||
pytest.xfail(reason='zip compression for csv not suppported in' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment. now it should not have to skip or xfail test_to_csv.
pandas/tests/frame/test_to_csv.py
Outdated
assert_frame_equal(df, read_csv(fh, index_col=0)) | ||
|
||
@pytest.mark.xfail(reason='zip compression is now supported for csv.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you xfailing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an old test case that assert raising a BadZipFile exception when zip compression was not supported. so it will now fail the test because it doesn't no longer raise that exception. this test case is now redundant and removed in 04886e9
result = fh.read().decode('utf8') | ||
assert_frame_equal(df, pd.read_json(result)) | ||
|
||
|
||
@pytest.mark.xfail(reason='zip compression is now supported for json.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you xfailing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
pandas/tests/series/test_io.py
Outdated
|
||
s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X') | ||
|
||
if PY2 and compression == 'zip': | ||
pytest.xfail(reason='zip compression for csv not suppported in' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this skip or xfail is no longer needed to handle zip compression (write) in Python 2.
index_col=0, squeeze=True) | ||
assert_series_equal(s, rs) | ||
|
||
# explicitly ensure file was compressed | ||
with tm.decompress_file(filename, compression_no_zip) as fh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any uses of the compression_no_zip fixture left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, the compression_no_zip fixture is solely for excluding zip compression in tests because writing zip compression had not been implemented.
@@ -425,6 +428,18 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None, | |||
return f, handles | |||
|
|||
|
|||
class BytesZipFile(ZipFile, BytesIO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a little bit more to this class doc-strings. e.g. why its needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added. we currently don't have ability to write zip compressed pickle, json, csv, only read them. standard library ZipFile isn't designed exactly to produce a writable file handle, hence the custom class.
pandas/io/formats/csvs.py
Outdated
@@ -150,6 +150,16 @@ def save(self): | |||
|
|||
self._save() | |||
|
|||
# GH 17778 handles compression for byte strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you handling this here and not in the finally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to finally.
pandas/io/pickle.py
Outdated
@@ -62,7 +62,6 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL): | |||
2 2 7 | |||
3 3 8 | |||
4 4 9 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add back the blank lines you removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added back.
zip compression now should work both read/write for pickle, json, csv in Python 2/3. @jreback all changes implemented. further comments welcome. |
@jreback @jorisvandenbossche any comments on this PR? |
pandas/core/frame.py
Outdated
allowed values are 'gzip', 'bz2', 'xz', | ||
only used when the first argument is a filename | ||
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the | ||
first argument is a filename. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix this parameter description a bit:
a string representing the compression to use in the output file.
Allow values are 'gzip', 'bz2', 'zip', 'xz'. This input is only used
when the first argument is a filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/core/series.py
Outdated
allowed values are 'gzip', 'bz2', 'xz', only used when the first | ||
argument is a filename | ||
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the | ||
first argument is a filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's the fix the docstring here as I suggested for frame.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
thanks @minggli keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently,
to_pickle
doesn't support compression='zip' whereasread_pickle
is able to unpickle zip compressed file. The same applies toto_json
,to_csv
methods under generic, frame, series.Standard library ZipFile class default write method requires a filename but pickle.dump(obj, f, protocol) requires f being a file-like object (i.e. file handle) which offers writing bytes. Create a new BytesZipFile class that allows pickle.dump to write bytes into zip file handle.
Now zip compressed objects (pickle, json, csv) are roundtripable with
read_pickle
,read_json
,read_csv
.Need suggestion as to where to put BytesZipFile class which overrides
write
method withwritestr
. Other comments welcome.