EHN: allow zip compression in `to_pickle`, `to_json`, `to_csv` #20394

minggli · 2018-03-17T19:16:51Z

closes DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Currently, to_pickle doesn't support compression='zip' whereas read_pickle is able to unpickle zip compressed file. The same applies to to_json, to_csv methods under generic, frame, series.

Standard library ZipFile class default write method requires a filename but pickle.dump(obj, f, protocol) requires f being a file-like object (i.e. file handle) which offers writing bytes. Create a new BytesZipFile class that allows pickle.dump to write bytes into zip file handle.

Now zip compressed objects (pickle, json, csv) are roundtripable with read_pickle, read_json, read_csv.

Need suggestion as to where to put BytesZipFile class which overrides write method with writestr. Other comments welcome.

gfyoung · 2018-03-18T10:07:08Z

pandas/io/common.py

@@ -425,6 +428,24 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
    return f, handles


+class BytesZipFile(ZipFile, BytesIO):


I personally like this location. I would keep it here.

codecov · 2018-03-18T12:19:37Z

Codecov Report

Merging #20394 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20394      +/-   ##
==========================================
+ Coverage   91.77%   91.79%   +0.01%     
==========================================
  Files         152      152              
  Lines       49205    49233      +28     
==========================================
+ Hits        45159    45191      +32     
+ Misses       4046     4042       -4

Flag	Coverage Δ
#multiple	`90.17% <100%> (+0.01%)`	⬆️
#single	`41.84% <19.23%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.85% <ø> (ø)`	⬆️
pandas/core/frame.py	`97.18% <ø> (ø)`	⬆️
pandas/util/testing.py	`84.52% <ø> (+0.57%)`	⬆️
pandas/core/series.py	`93.84% <ø> (ø)`	⬆️
pandas/io/formats/csvs.py	`98.13% <100%> (+0.08%)`	⬆️
pandas/io/common.py	`70.04% <100%> (+1.26%)`	⬆️
pandas/core/window.py	`96.26% <0%> (-0.01%)`	⬇️
pandas/core/panel.py	`97.29% <0%> (ø)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7273ea0...ebd8e6f. Read the comment docs.

minggli · 2018-03-18T15:18:32Z

ensure_clean() somehow fails to randomize on one of the 2.7 configuration with xdist. strange.

when pickle seems to work different in Python 2 and not seem to be way around it. making zip compression only available for Python>=3.x

minggli · 2018-03-19T18:29:04Z

zip compression for pickle now should work in Python 2 now as well as Python 3, so does zip compression for json.

However, csv zip compression works only in Python 3, not Python 2.

jreback · 2018-03-20T00:03:57Z

pandas/tests/frame/test_to_csv.py


        df = DataFrame([[0.123456, 0.234567, 0.567567],
                        [12.32112, 123123.2, 321321.2]],
                       index=['A', 'B'], columns=['X', 'Y', 'Z'])

+        if PY2 and compression == 'zip':
+            pytest.xfail(reason='zip compression for csv not suppported in'


this should be a skip

thanks for the comment. now it should not have to skip or xfail test_to_csv.

jreback · 2018-03-20T00:04:09Z

pandas/tests/frame/test_to_csv.py

                assert_frame_equal(df, read_csv(fh, index_col=0))

+    @pytest.mark.xfail(reason='zip compression is now supported for csv.')


why are you xfailing this?

this is an old test case that assert raising a BadZipFile exception when zip compression was not supported. so it will now fail the test because it doesn't no longer raise that exception. this test case is now redundant and removed in 04886e9

jreback · 2018-03-20T00:04:20Z

pandas/tests/io/json/test_compression.py

            result = fh.read().decode('utf8')
        assert_frame_equal(df, pd.read_json(result))


+@pytest.mark.xfail(reason='zip compression is now supported for json.')


why are you xfailing this?

same as above.

jreback · 2018-03-20T00:04:32Z

pandas/tests/series/test_io.py


        s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
                   name='X')

+        if PY2 and compression == 'zip':
+            pytest.xfail(reason='zip compression for csv not suppported in'


this skip or xfail is no longer needed to handle zip compression (write) in Python 2.

jreback · 2018-03-20T00:04:54Z

pandas/tests/series/test_io.py

                             index_col=0, squeeze=True)
            assert_series_equal(s, rs)

            # explicitly ensure file was compressed
-            with tm.decompress_file(filename, compression_no_zip) as fh:


are there any uses of the compression_no_zip fixture left?

I don't think so, the compression_no_zip fixture is solely for excluding zip compression in tests because writing zip compression had not been implemented.

jreback · 2018-03-20T00:05:17Z

pandas/io/common.py

@@ -425,6 +428,18 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
    return f, handles


+class BytesZipFile(ZipFile, BytesIO):


can you add a little bit more to this class doc-strings. e.g. why its needed.

added. we currently don't have ability to write zip compressed pickle, json, csv, only read them. standard library ZipFile isn't designed exactly to produce a writable file handle, hence the custom class.

…or csv and json

jreback · 2018-03-20T10:21:53Z

pandas/io/formats/csvs.py

@@ -150,6 +150,16 @@ def save(self):

            self._save()

+            # GH 17778 handles compression for byte strings.


why are you handling this here and not in the finally?

moved to finally.

jreback · 2018-03-20T10:22:35Z

pandas/io/pickle.py

@@ -62,7 +62,6 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
    2    2    7
    3    3    8
    4    4    9
-


can you add back the blank lines you removed

added back.

minggli · 2018-03-20T12:57:06Z

zip compression now should work both read/write for pickle, json, csv in Python 2/3.

@jreback all changes implemented. further comments welcome.

minggli · 2018-03-21T22:35:35Z

@jreback @jorisvandenbossche any comments on this PR?

gfyoung · 2018-03-21T22:47:21Z

pandas/core/frame.py

-            allowed values are 'gzip', 'bz2', 'xz',
-            only used when the first argument is a filename
+            allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
+            first argument is a filename.


Let's fix this parameter description a bit:

a string representing the compression to use in the output file. Allow values are 'gzip', 'bz2', 'zip', 'xz'. This input is only used when the first argument is a filename.

gfyoung · 2018-03-21T22:50:14Z

pandas/core/series.py

-            allowed values are 'gzip', 'bz2', 'xz', only used when the first
-            argument is a filename
+            allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
+            first argument is a filename


Let's the fix the docstring here as I suggested for frame.py.

minggli · 2018-03-22T20:38:06Z

@gfyoung @jreback any comments?

jreback · 2018-03-22T23:12:18Z

thanks @minggli keep em coming!

…s-dev#20394)

minggli added 3 commits March 17, 2018 19:01

initial commit

ccfd240

add zip to compression

fd7362c

add zip to compression in to_pickle

c570091

minggli changed the title ~~EHN: allow to_pickle to produce zip compressed pickle~~ EHN: allow zip compression in to_pickle Mar 17, 2018

minggli added 3 commits March 17, 2018 23:36

inherit io.BufferedIOBase

ec712b9

xfail test_compress_zip_value_error

bf271ce

add zip in compression parameter description

113db83

minggli changed the title ~~EHN: allow zip compression in to_pickle~~ EHN: allow zip compression in to_pickle, to_json, to_csv Mar 18, 2018

minggli added 5 commits March 18, 2018 00:48

xfail test_to_csv_compression_value_error

9b9e5d1

include zip in all tests

dedb853

move BytesZipFile out of _get_handle

dfa9913

inherit BytesIO

67b9727

restore import pattern

ecdf5a2

gfyoung added Bug IO Data IO issues that don't fit into a more specific label labels Mar 18, 2018

gfyoung reviewed Mar 18, 2018

View reviewed changes

attributes already implemented in Base class

b9fab3c

minggli added 5 commits March 18, 2018 12:45

add zip in compression parameter description

5c5c161

prevent writing duplicates

d072ca8

prevent writing duplicates

cecb0ac

add whatsnew entry in Other Enhancement

ed189c4

revert prevent duplicate

4ac9488

minggli force-pushed the bugfix/to_pickle branch from 1b5fc77 to 4ac9488 Compare March 18, 2018 16:25

minggli added 5 commits March 18, 2018 22:05

xfail zip compression csv pickle in python 2.x

694c6b5

xfail zip compression csv pickle in python 2.x

80992a3

writing zip compression not supported in Python 2

3288691

compression parameter descriptions

272c6e7

compression parameter descriptions

d35b6af

minggli added 2 commits March 19, 2018 17:25

revert conftest

dd958ac

tests xfail on csv zip compression in Python 2

2956103

jreback requested changes Mar 20, 2018

View reviewed changes

minggli added 5 commits March 20, 2018 09:34

handle csv compression seperately

63890ec

revert xfail on tests csv

e4966be

decommission compression_no_zip

437d716

remove value error test cases now that zip compression is supported f…

04886e9

…or csv and json

update whatsnew

099993c

jreback requested changes Mar 20, 2018

View reviewed changes

minggli added 3 commits March 20, 2018 10:56

docstring for BytesZipFile

6aa1493

add back blank lines

129a55a

move csv compression seperately

4531c78

minggli closed this Mar 20, 2018

minggli reopened this Mar 20, 2018

gfyoung reviewed Mar 21, 2018

View reviewed changes

parameter description

ebd8e6f

jreback added this to the 0.23.0 milestone Mar 22, 2018

jreback approved these changes Mar 22, 2018

View reviewed changes

jreback merged commit 76534d5 into pandas-dev:master Mar 22, 2018

minggli deleted the bugfix/to_pickle branch March 22, 2018 23:58

javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018

EHN: allow zip compression in to_pickle, to_json, to_csv (panda…

189dd8e

…s-dev#20394)

dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018

EHN: allow zip compression in to_pickle, to_json, to_csv (panda…

8de5c79

…s-dev#20394)

kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018

EHN: allow zip compression in to_pickle, to_json, to_csv (panda…

29b0bab

…s-dev#20394)

minggli mentioned this pull request Jul 26, 2018

DOC: consistent docstring for compression kwarg #22066

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN: allow zip compression in `to_pickle`, `to_json`, `to_csv` #20394

EHN: allow zip compression in `to_pickle`, `to_json`, `to_csv` #20394

minggli commented Mar 17, 2018 •

edited

Loading

gfyoung Mar 18, 2018 •

edited

Loading

codecov bot commented Mar 18, 2018 •

edited

Loading

minggli commented Mar 18, 2018 •

edited

Loading

minggli commented Mar 19, 2018 •

edited

Loading

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018 •

edited

Loading

jreback Mar 20, 2018

minggli Mar 20, 2018

jreback Mar 20, 2018

minggli Mar 20, 2018

minggli commented Mar 20, 2018 •

edited

Loading

minggli commented Mar 21, 2018

gfyoung Mar 21, 2018

minggli Mar 22, 2018

gfyoung Mar 21, 2018

minggli Mar 22, 2018

minggli commented Mar 22, 2018

jreback commented Mar 22, 2018

		@@ -425,6 +428,24 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
		return f, handles


		class BytesZipFile(ZipFile, BytesIO):

		assert_frame_equal(df, read_csv(fh, index_col=0))

		@pytest.mark.xfail(reason='zip compression is now supported for csv.')

		@@ -425,6 +428,18 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
		return f, handles


		class BytesZipFile(ZipFile, BytesIO):

		@@ -150,6 +150,16 @@ def save(self):

		self._save()

		# GH 17778 handles compression for byte strings.

 2    7
 3    8
 4    9

EHN: allow zip compression in to_pickle, to_json, to_csv #20394

EHN: allow zip compression in to_pickle, to_json, to_csv #20394

Conversation

minggli commented Mar 17, 2018 • edited Loading

gfyoung Mar 18, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Mar 18, 2018 • edited Loading

Codecov Report

minggli commented Mar 18, 2018 • edited Loading

minggli commented Mar 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minggli Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minggli commented Mar 20, 2018 • edited Loading

minggli commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minggli commented Mar 22, 2018

jreback commented Mar 22, 2018

EHN: allow zip compression in `to_pickle`, `to_json`, `to_csv` #20394

EHN: allow zip compression in `to_pickle`, `to_json`, `to_csv` #20394

minggli commented Mar 17, 2018 •

edited

Loading

gfyoung Mar 18, 2018 •

edited

Loading

codecov bot commented Mar 18, 2018 •

edited

Loading

minggli commented Mar 18, 2018 •

edited

Loading

minggli commented Mar 19, 2018 •

edited

Loading

minggli Mar 20, 2018 •

edited

Loading

minggli commented Mar 20, 2018 •

edited

Loading