Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: allow zip compression in to_pickle, to_json, to_csv #20394

Merged
merged 39 commits into from
Mar 22, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Mar 17, 2018

Currently, to_pickle doesn't support compression='zip' whereas read_pickle is able to unpickle zip compressed file. The same applies to to_json, to_csv methods under generic, frame, series.

Standard library ZipFile class default write method requires a filename but pickle.dump(obj, f, protocol) requires f being a file-like object (i.e. file handle) which offers writing bytes. Create a new BytesZipFile class that allows pickle.dump to write bytes into zip file handle.

Now zip compressed objects (pickle, json, csv) are roundtripable with read_pickle, read_json, read_csv.

Need suggestion as to where to put BytesZipFile class which overrides write method with writestr. Other comments welcome.

@minggli minggli changed the title EHN: allow to_pickle to produce zip compressed pickle EHN: allow zip compression in to_pickle Mar 17, 2018
@minggli minggli changed the title EHN: allow zip compression in to_pickle EHN: allow zip compression in to_pickle, to_json, to_csv Mar 18, 2018
@gfyoung gfyoung added Bug IO Data IO issues that don't fit into a more specific label labels Mar 18, 2018
@@ -425,6 +428,24 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
return f, handles


class BytesZipFile(ZipFile, BytesIO):
Copy link
Member

@gfyoung gfyoung Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like this location. I would keep it here.

@codecov
Copy link

codecov bot commented Mar 18, 2018

Codecov Report

Merging #20394 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20394      +/-   ##
==========================================
+ Coverage   91.77%   91.79%   +0.01%     
==========================================
  Files         152      152              
  Lines       49205    49233      +28     
==========================================
+ Hits        45159    45191      +32     
+ Misses       4046     4042       -4
Flag Coverage Δ
#multiple 90.17% <100%> (+0.01%) ⬆️
#single 41.84% <19.23%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.85% <ø> (ø) ⬆️
pandas/core/frame.py 97.18% <ø> (ø) ⬆️
pandas/util/testing.py 84.52% <ø> (+0.57%) ⬆️
pandas/core/series.py 93.84% <ø> (ø) ⬆️
pandas/io/formats/csvs.py 98.13% <100%> (+0.08%) ⬆️
pandas/io/common.py 70.04% <100%> (+1.26%) ⬆️
pandas/core/window.py 96.26% <0%> (-0.01%) ⬇️
pandas/core/panel.py 97.29% <0%> (ø) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7273ea0...ebd8e6f. Read the comment docs.

@minggli
Copy link
Contributor Author

minggli commented Mar 18, 2018

ensure_clean() somehow fails to randomize on one of the 2.7 configuration with xdist. strange.

when pickle seems to work different in Python 2 and not seem to be way around it. making zip compression only available for Python>=3.x

@minggli
Copy link
Contributor Author

minggli commented Mar 19, 2018

zip compression for pickle now should work in Python 2 now as well as Python 3, so does zip compression for json.

However, csv zip compression works only in Python 3, not Python 2.


df = DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z'])

if PY2 and compression == 'zip':
pytest.xfail(reason='zip compression for csv not suppported in'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment. now it should not have to skip or xfail test_to_csv.

assert_frame_equal(df, read_csv(fh, index_col=0))

@pytest.mark.xfail(reason='zip compression is now supported for csv.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you xfailing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an old test case that assert raising a BadZipFile exception when zip compression was not supported. so it will now fail the test because it doesn't no longer raise that exception. this test case is now redundant and removed in 04886e9

result = fh.read().decode('utf8')
assert_frame_equal(df, pd.read_json(result))


@pytest.mark.xfail(reason='zip compression is now supported for json.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you xfailing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.


s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
name='X')

if PY2 and compression == 'zip':
pytest.xfail(reason='zip compression for csv not suppported in'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skip or xfail is no longer needed to handle zip compression (write) in Python 2.

index_col=0, squeeze=True)
assert_series_equal(s, rs)

# explicitly ensure file was compressed
with tm.decompress_file(filename, compression_no_zip) as fh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any uses of the compression_no_zip fixture left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the compression_no_zip fixture is solely for excluding zip compression in tests because writing zip compression had not been implemented.

@@ -425,6 +428,18 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
return f, handles


class BytesZipFile(ZipFile, BytesIO):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a little bit more to this class doc-strings. e.g. why its needed.

Copy link
Contributor Author

@minggli minggli Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. we currently don't have ability to write zip compressed pickle, json, csv, only read them. standard library ZipFile isn't designed exactly to produce a writable file handle, hence the custom class.

@@ -150,6 +150,16 @@ def save(self):

self._save()

# GH 17778 handles compression for byte strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you handling this here and not in the finally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to finally.

@@ -62,7 +62,6 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
2 2 7
3 3 8
4 4 9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add back the blank lines you removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back.

@minggli minggli closed this Mar 20, 2018
@minggli minggli reopened this Mar 20, 2018
@minggli
Copy link
Contributor Author

minggli commented Mar 20, 2018

zip compression now should work both read/write for pickle, json, csv in Python 2/3.

@jreback all changes implemented. further comments welcome.

@minggli
Copy link
Contributor Author

minggli commented Mar 21, 2018

@jreback @jorisvandenbossche any comments on this PR?

allowed values are 'gzip', 'bz2', 'xz',
only used when the first argument is a filename
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
first argument is a filename.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix this parameter description a bit:

a string representing the compression to use in the output file.
Allow values are 'gzip', 'bz2', 'zip', 'xz'.  This input is only used
when the first argument is a filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

allowed values are 'gzip', 'bz2', 'xz', only used when the first
argument is a filename
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
first argument is a filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's the fix the docstring here as I suggested for frame.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@minggli
Copy link
Contributor Author

minggli commented Mar 22, 2018

@gfyoung @jreback any comments?

@jreback jreback added this to the 0.23.0 milestone Mar 22, 2018
@jreback jreback merged commit 76534d5 into pandas-dev:master Mar 22, 2018
@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

thanks @minggli keep em coming!

@minggli minggli deleted the bugfix/to_pickle branch March 22, 2018 23:58
javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants