Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: allow zip compression in to_pickle, to_json, to_csv #20394

Merged
merged 39 commits into from
Mar 22, 2018
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
ccfd240
initial commit
minggli Mar 17, 2018
fd7362c
add zip to compression
minggli Mar 17, 2018
c570091
add zip to compression in to_pickle
minggli Mar 17, 2018
ec712b9
inherit io.BufferedIOBase
minggli Mar 17, 2018
bf271ce
xfail test_compress_zip_value_error
minggli Mar 17, 2018
113db83
add zip in compression parameter description
minggli Mar 18, 2018
9b9e5d1
xfail test_to_csv_compression_value_error
minggli Mar 18, 2018
dedb853
include zip in all tests
minggli Mar 18, 2018
dfa9913
move BytesZipFile out of _get_handle
minggli Mar 18, 2018
67b9727
inherit BytesIO
minggli Mar 18, 2018
ecdf5a2
restore import pattern
minggli Mar 18, 2018
b9fab3c
attributes already implemented in Base class
minggli Mar 18, 2018
5c5c161
add zip in compression parameter description
minggli Mar 18, 2018
d072ca8
prevent writing duplicates
minggli Mar 18, 2018
cecb0ac
prevent writing duplicates
minggli Mar 18, 2018
ed189c4
add whatsnew entry in Other Enhancement
minggli Mar 18, 2018
4ac9488
revert prevent duplicate
minggli Mar 18, 2018
694c6b5
xfail zip compression csv pickle in python 2.x
minggli Mar 18, 2018
80992a3
xfail zip compression csv pickle in python 2.x
minggli Mar 18, 2018
3288691
writing zip compression not supported in Python 2
minggli Mar 18, 2018
272c6e7
compression parameter descriptions
minggli Mar 18, 2018
d35b6af
compression parameter descriptions
minggli Mar 18, 2018
c6034b4
skip zip in Python 2
minggli Mar 18, 2018
71d9979
revert tests xfail
minggli Mar 18, 2018
4c87e0f
update whatsnew
minggli Mar 18, 2018
fd44980
fix compat import
minggli Mar 18, 2018
ab7a7b7
enable zip compression for Python 2 by avoid pickle.dump
minggli Mar 19, 2018
cfd0715
remove descriptinos zip only supported by Python3
minggli Mar 19, 2018
dd958ac
revert conftest
minggli Mar 19, 2018
2956103
tests xfail on csv zip compression in Python 2
minggli Mar 19, 2018
63890ec
handle csv compression seperately
minggli Mar 20, 2018
e4966be
revert xfail on tests csv
minggli Mar 20, 2018
437d716
decommission compression_no_zip
minggli Mar 20, 2018
04886e9
remove value error test cases now that zip compression is supported f…
minggli Mar 20, 2018
099993c
update whatsnew
minggli Mar 20, 2018
6aa1493
docstring for BytesZipFile
minggli Mar 20, 2018
129a55a
add back blank lines
minggli Mar 20, 2018
4531c78
move csv compression seperately
minggli Mar 20, 2018
ebd8e6f
parameter description
minggli Mar 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- zip compression is supported via ``compression=zip`` for python >= 3 in :func:`DataFrame.to_pickle`, :func:`Series.to_pickle`, :func:`DataFrame.to_csv`, :func:`Series.to_csv`, :func:`DataFrame.to_json`, :func:`Series.to_json`. (:issue:`17778`)

.. _whatsnew_0230.api_breaking:

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1655,8 +1655,8 @@ def to_csv(self, path_or_buf=None, sep=",", na_rep='', float_format=None,
defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.
compression : string, optional
a string representing the compression to use in the output file,
allowed values are 'gzip', 'bz2', 'xz',
only used when the first argument is a filename
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
first argument is a filename.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix this parameter description a bit:

a string representing the compression to use in the output file.
Allow values are 'gzip', 'bz2', 'zip', 'xz'.  This input is only used
when the first argument is a filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

line_terminator : string, default ``'\n'``
The newline character or character sequence to use in the output
file
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1814,9 +1814,9 @@ def to_json(self, path_or_buf=None, orient=None, date_format=None,

.. versionadded:: 0.19.0

compression : {None, 'gzip', 'bz2', 'xz'}
compression : {None, 'gzip', 'bz2', 'zip', 'xz'}
A string representing the compression to use in the output file,
only used when the first argument is a filename
only used when the first argument is a filename.

.. versionadded:: 0.21.0

Expand Down Expand Up @@ -2085,7 +2085,8 @@ def to_pickle(self, path, compression='infer',
----------
path : str
File path where the pickled object will be stored.
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, \
default 'infer'
A string representing the compression to use in the output file. By
default, infers from the file extension in specified path.

Expand Down Expand Up @@ -2129,7 +2130,6 @@ def to_pickle(self, path, compression='infer',
2 2 7
3 3 8
4 4 9

>>> import os
>>> os.remove("./dummy.pkl")
"""
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -3633,8 +3633,8 @@ def to_csv(self, path=None, index=True, sep=",", na_rep='',
non-ascii, for python versions prior to 3
compression : string, optional
a string representing the compression to use in the output file,
allowed values are 'gzip', 'bz2', 'xz', only used when the first
argument is a filename
allowed values are 'gzip', 'bz2', 'zip', 'xz', only used when the
first argument is a filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's the fix the docstring here as I suggested for frame.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

date_format: string, default None
Format string for datetime objects.
decimal: string, default '.'
Expand Down
39 changes: 27 additions & 12 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import codecs
import mmap
from contextlib import contextmanager, closing
from zipfile import ZipFile

from pandas.compat import StringIO, BytesIO, string_types, text_type
from pandas import compat
Expand Down Expand Up @@ -363,18 +364,20 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,

# ZIP Compression
elif compression == 'zip':
import zipfile
zip_file = zipfile.ZipFile(path_or_buf)
zip_names = zip_file.namelist()
if len(zip_names) == 1:
f = zip_file.open(zip_names.pop())
elif len(zip_names) == 0:
raise ValueError('Zero files found in ZIP file {}'
.format(path_or_buf))
else:
raise ValueError('Multiple files found in ZIP file.'
' Only one file per ZIP: {}'
.format(zip_names))
zf = BytesZipFile(path_or_buf, mode)
if zf.mode == 'w':
f = zf
elif zf.mode == 'r':
zip_names = zf.namelist()
if len(zip_names) == 1:
f = zf.open(zip_names.pop())
elif len(zip_names) == 0:
raise ValueError('Zero files found in ZIP file {}'
.format(path_or_buf))
else:
raise ValueError('Multiple files found in ZIP file.'
' Only one file per ZIP: {}'
.format(zip_names))

# XZ Compression
elif compression == 'xz':
Expand Down Expand Up @@ -425,6 +428,18 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
return f, handles


class BytesZipFile(ZipFile, BytesIO):
Copy link
Member

@gfyoung gfyoung Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like this location. I would keep it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a little bit more to this class doc-strings. e.g. why its needed.

Copy link
Contributor Author

@minggli minggli Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. we currently don't have ability to write zip compressed pickle, json, csv, only read them. standard library ZipFile isn't designed exactly to produce a writable file handle, hence the custom class.

"""override write method with writestr to accept bytes."""
# GH 17778
def __init__(self, file, mode='r', **kwargs):
if mode in ['wb', 'rb']:
mode = mode.replace('b', '')
super(BytesZipFile, self).__init__(file, mode, **kwargs)

def write(self, data):
super(BytesZipFile, self).writestr(self.filename, data)


class MMapWrapper(BaseIterator):
"""
Wrapper for the Python's mmap class so that it can be properly read in
Expand Down
8 changes: 3 additions & 5 deletions pandas/io/pickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
Any python object.
path : str
File path where the pickled object will be stored.
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
A string representing the compression to use in the output file. By
default, infers from the file extension in specified path.

Expand Down Expand Up @@ -62,7 +62,6 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
2 2 7
3 3 8
4 4 9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add back the blank lines you removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back.

>>> import os
>>> os.remove("./dummy.pkl")
"""
Expand All @@ -74,7 +73,7 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
if protocol < 0:
protocol = pkl.HIGHEST_PROTOCOL
try:
pkl.dump(obj, f, protocol=protocol)
f.write(pkl.dumps(obj, protocol=protocol))
finally:
for _f in fh:
_f.close()
Expand All @@ -93,7 +92,7 @@ def read_pickle(path, compression='infer'):
----------
path : str
File path where the pickled object will be loaded.
compression : {'infer', 'gzip', 'bz2', 'xz', 'zip', None}, default 'infer'
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use
gzip, bz2, xz or zip if path ends in '.gz', '.bz2', '.xz',
or '.zip' respectively, and no decompression otherwise.
Expand Down Expand Up @@ -133,7 +132,6 @@ def read_pickle(path, compression='infer'):
2 2 7
3 3 8
4 4 9

>>> import os
>>> os.remove("./dummy.pkl")
"""
Expand Down
17 changes: 11 additions & 6 deletions pandas/tests/frame/test_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from numpy import nan
import numpy as np

from pandas.compat import (lmap, range, lrange, StringIO, u)
from pandas.compat import (lmap, range, lrange, StringIO, u, PY2)
import pandas.core.common as com
from pandas.errors import ParserError
from pandas import (DataFrame, Index, Series, MultiIndex, Timestamp,
Expand Down Expand Up @@ -919,30 +919,35 @@ def test_to_csv_path_is_none(self):
recons = pd.read_csv(StringIO(csv_str), index_col=0)
assert_frame_equal(self.frame, recons)

def test_to_csv_compression(self, compression_no_zip):
def test_to_csv_compression(self, compression):

df = DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z'])

if PY2 and compression == 'zip':
pytest.xfail(reason='zip compression for csv not suppported in'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment. now it should not have to skip or xfail test_to_csv.

'Python 2')

with ensure_clean() as filename:

df.to_csv(filename, compression=compression_no_zip)
df.to_csv(filename, compression=compression)

# test the round trip - to_csv -> read_csv
rs = read_csv(filename, compression=compression_no_zip,
rs = read_csv(filename, compression=compression,
index_col=0)
assert_frame_equal(df, rs)

# explicitly make sure file is compressed
with tm.decompress_file(filename, compression_no_zip) as fh:
with tm.decompress_file(filename, compression) as fh:
text = fh.read().decode('utf8')
for col in df.columns:
assert col in text

with tm.decompress_file(filename, compression_no_zip) as fh:
with tm.decompress_file(filename, compression) as fh:
assert_frame_equal(df, read_csv(fh, index_col=0))

@pytest.mark.xfail(reason='zip compression is now supported for csv.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you xfailing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an old test case that assert raising a BadZipFile exception when zip compression was not supported. so it will now fail the test because it doesn't no longer raise that exception. this test case is now redundant and removed in 04886e9

def test_to_csv_compression_value_error(self):
# GH7615
# use the compression kw in to_csv
Expand Down
27 changes: 14 additions & 13 deletions pandas/tests/io/json/test_compression.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,23 @@
from pandas.util.testing import assert_frame_equal, assert_raises_regex


def test_compression_roundtrip(compression_no_zip):
def test_compression_roundtrip(compression):
df = pd.DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z'])

with tm.ensure_clean() as path:
df.to_json(path, compression=compression_no_zip)
df.to_json(path, compression=compression)
assert_frame_equal(df, pd.read_json(path,
compression=compression_no_zip))
compression=compression))

# explicitly ensure file was compressed.
with tm.decompress_file(path, compression_no_zip) as fh:
with tm.decompress_file(path, compression) as fh:
result = fh.read().decode('utf8')
assert_frame_equal(df, pd.read_json(result))


@pytest.mark.xfail(reason='zip compression is now supported for json.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you xfailing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

def test_compress_zip_value_error():
df = pd.DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
Expand All @@ -41,7 +42,7 @@ def test_read_zipped_json():
assert_frame_equal(uncompressed_df, compressed_df)


def test_with_s3_url(compression_no_zip):
def test_with_s3_url(compression):
boto3 = pytest.importorskip('boto3')
pytest.importorskip('s3fs')
moto = pytest.importorskip('moto')
Expand All @@ -52,35 +53,35 @@ def test_with_s3_url(compression_no_zip):
bucket = conn.create_bucket(Bucket="pandas-test")

with tm.ensure_clean() as path:
df.to_json(path, compression=compression_no_zip)
df.to_json(path, compression=compression)
with open(path, 'rb') as f:
bucket.put_object(Key='test-1', Body=f)

roundtripped_df = pd.read_json('s3://pandas-test/test-1',
compression=compression_no_zip)
compression=compression)
assert_frame_equal(df, roundtripped_df)


def test_lines_with_compression(compression_no_zip):
def test_lines_with_compression(compression):

with tm.ensure_clean() as path:
df = pd.read_json('{"a": [1, 2, 3], "b": [4, 5, 6]}')
df.to_json(path, orient='records', lines=True,
compression=compression_no_zip)
compression=compression)
roundtripped_df = pd.read_json(path, lines=True,
compression=compression_no_zip)
compression=compression)
assert_frame_equal(df, roundtripped_df)


def test_chunksize_with_compression(compression_no_zip):
def test_chunksize_with_compression(compression):

with tm.ensure_clean() as path:
df = pd.read_json('{"a": ["foo", "bar", "baz"], "b": [4, 5, 6]}')
df.to_json(path, orient='records', lines=True,
compression=compression_no_zip)
compression=compression)

res = pd.read_json(path, lines=True, chunksize=1,
compression=compression_no_zip)
compression=compression)
roundtripped_df = pd.concat(res)
assert_frame_equal(df, roundtripped_df)

Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/io/test_pickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,7 @@ def compress_file(self, src_path, dest_path, compression):
f.write(fh.read())
f.close()

def test_write_explicit(self, compression_no_zip, get_random_path):
def test_write_explicit(self, compression, get_random_path):
base = get_random_path
path1 = base + ".compressed"
path2 = base + ".raw"
Expand All @@ -361,10 +361,10 @@ def test_write_explicit(self, compression_no_zip, get_random_path):
df = tm.makeDataFrame()

# write to compressed file
df.to_pickle(p1, compression=compression_no_zip)
df.to_pickle(p1, compression=compression)

# decompress
with tm.decompress_file(p1, compression=compression_no_zip) as f:
with tm.decompress_file(p1, compression=compression) as f:
with open(p2, "wb") as fh:
fh.write(f.read())

Expand Down
16 changes: 10 additions & 6 deletions pandas/tests/series/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

from pandas import Series, DataFrame

from pandas.compat import StringIO, u
from pandas.compat import StringIO, u, PY2
from pandas.util.testing import (assert_series_equal, assert_almost_equal,
assert_frame_equal, ensure_clean)
import pandas.util.testing as tm
Expand Down Expand Up @@ -138,26 +138,30 @@ def test_to_csv_path_is_none(self):
csv_str = s.to_csv(path=None)
assert isinstance(csv_str, str)

def test_to_csv_compression(self, compression_no_zip):
def test_to_csv_compression(self, compression):

s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
name='X')

if PY2 and compression == 'zip':
pytest.xfail(reason='zip compression for csv not suppported in'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skip or xfail is no longer needed to handle zip compression (write) in Python 2.

'Python 2')

with ensure_clean() as filename:

s.to_csv(filename, compression=compression_no_zip, header=True)
s.to_csv(filename, compression=compression, header=True)

# test the round trip - to_csv -> read_csv
rs = pd.read_csv(filename, compression=compression_no_zip,
rs = pd.read_csv(filename, compression=compression,
index_col=0, squeeze=True)
assert_series_equal(s, rs)

# explicitly ensure file was compressed
with tm.decompress_file(filename, compression_no_zip) as fh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any uses of the compression_no_zip fixture left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the compression_no_zip fixture is solely for excluding zip compression in tests because writing zip compression had not been implemented.

with tm.decompress_file(filename, compression) as fh:
text = fh.read().decode('utf8')
assert s.name in text

with tm.decompress_file(filename, compression_no_zip) as fh:
with tm.decompress_file(filename, compression) as fh:
assert_series_equal(s, pd.read_csv(fh,
index_col=0,
squeeze=True))
Expand Down
2 changes: 1 addition & 1 deletion pandas/util/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ def decompress_file(path, compression):
path : str
The path where the file is read from

compression : {'gzip', 'bz2', 'xz', None}
compression : {'gzip', 'bz2', 'zip', 'xz', None}
Name of the decompression to use

Returns
Expand Down