Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to_* methods to compression='infer' #22011

Merged
merged 40 commits into from
Aug 1, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
8689167
Default to_csv & to_json to compression='infer'
dhimmel Jul 21, 2018
3ccfb00
to_json compression=infer in pandas/core/generic.py
dhimmel Jul 21, 2018
648bf4d
Simplify CSVFormatter.save
dhimmel Jul 21, 2018
be724fa
Exploratory commit of what CSVFormatter.save should look like
dhimmel Jul 21, 2018
9fe27c9
fixup! Simplify CSVFormatter.save
dhimmel Jul 23, 2018
65f0689
"Revert changes not related to compression default
dhimmel Jul 23, 2018
868e671
TST: test to_csv infers compression by default
dhimmel Jul 23, 2018
c3b76ee
Debugging print statements
dhimmel Jul 23, 2018
cebc0d9
Debugging: use logging rather than print
dhimmel Jul 23, 2018
8411eb2
_infer_compression in CSVFormatter
dhimmel Jul 26, 2018
c098c8f
CSVFormatter: process encoding in init for consistency
dhimmel Jul 26, 2018
2f6601d
TST + DOC: test_compression_warning docstring
dhimmel Jul 26, 2018
eb7f9b5
fixup! CSVFormatter: process encoding in init for consistency
dhimmel Jul 26, 2018
d4a5c90
Tests passing: remove debugging
dhimmel Jul 26, 2018
abd19e3
Parametrized test for compression='infer' is default
dhimmel Jul 26, 2018
2f670fe
Default compression='infer' in series.to_csv
dhimmel Jul 26, 2018
aa9ce13
What's New Entry for v0.24.0
dhimmel Jul 26, 2018
a6aabad
Remove unused tmpdir fixture argument
dhimmel Jul 26, 2018
8a0c97e
Update to_json docstring
dhimmel Jul 26, 2018
6be808d
Change test docstrings to comments
dhimmel Jul 26, 2018
63e6591
Consolidate testing to a single parametrized test
dhimmel Jul 26, 2018
fadb943
Split test_compression_defaults_to_infer into Series & DataFrame tests
dhimmel Jul 26, 2018
0edffc7
Parametrize write_kwargs
dhimmel Jul 26, 2018
97f5de5
Fix kwargs in test_series_compression_defaults_to_infer
dhimmel Jul 26, 2018
83bc0a8
Attempt to fix CSV series roundtrip
dhimmel Jul 26, 2018
874a4bf
Fix test failure
dhimmel Jul 26, 2018
14c3945
Python 2 flake8 error
dhimmel Jul 26, 2018
9a4dc41
Reduce / remove comments
dhimmel Jul 27, 2018
25bdb4c
Merge master: fix zip-docs conflict
dhimmel Jul 29, 2018
1ba8f3a
DOC: versionchanged & tweaks
dhimmel Jul 30, 2018
24e051e
Update doc/source/io.rst as needed
dhimmel Jul 30, 2018
387d1d2
Move tests from tests/test_common.py to tests/io/test_common.py
dhimmel Jul 30, 2018
12f14e2
Organize / simplify pandas/tests/test_common.py imports
dhimmel Jul 30, 2018
6db23d9
Ignore flake error needed for test
dhimmel Jul 30, 2018
e3a0f56
fixup! Organize / simplify pandas/tests/test_common.py imports
dhimmel Jul 30, 2018
af8c137
change import: cmn to icom
dhimmel Jul 31, 2018
f8829a6
Blank lines after versionchanged
dhimmel Jul 31, 2018
918c0f8
Move compression tests to new file tests/io/test_compression.py
dhimmel Jul 31, 2018
eadf68e
blank lines before .. versionchanged
dhimmel Jul 31, 2018
cf5b62e
Remove comments and space after GH
dhimmel Aug 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1713,7 +1713,7 @@ def to_panel(self):

def to_csv(self, path_or_buf=None, sep=",", na_rep='', float_format=None,
columns=None, header=True, index=True, index_label=None,
mode='w', encoding=None, compression=None, quoting=None,
mode='w', encoding=None, compression='infer', quoting=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add versionchanged in each of the modified doc-strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1ba8f3a

quotechar='"', line_terminator='\n', chunksize=None,
tupleize_cols=None, date_format=None, doublequote=True,
escapechar=None, decimal='.'):
Expand Down Expand Up @@ -1748,7 +1748,7 @@ def to_csv(self, path_or_buf=None, sep=",", na_rep='', float_format=None,
encoding : string, optional
A string representing the encoding to use in the output file,
defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default None
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default infer
If 'infer' and `path_or_buf` is path-like, then detect compression
from the following extensions: '.gz', '.bz2' or '.xz'
(otherwise no compression).
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1899,7 +1899,7 @@ def _repr_latex_(self):

def to_json(self, path_or_buf=None, orient=None, date_format=None,
double_precision=10, force_ascii=True, date_unit='ms',
default_handler=None, lines=False, compression=None,
default_handler=None, lines=False, compression='infer',
index=True):
"""
Convert the object to a JSON string.
Expand Down
8 changes: 7 additions & 1 deletion pandas/io/formats/csvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class CSVFormatter(object):
def __init__(self, obj, path_or_buf=None, sep=",", na_rep='',
float_format=None, cols=None, header=True, index=True,
index_label=None, mode='w', nanRep=None, encoding=None,
compression=None, quoting=None, line_terminator='\n',
compression='infer', quoting=None, line_terminator='\n',
chunksize=None, tupleize_cols=False, quotechar='"',
date_format=None, doublequote=True, escapechar=None,
decimal='.'):
Expand Down Expand Up @@ -134,7 +134,13 @@ def save(self):
encoding = self.encoding

# GH 21227 internal compression is not used when file-like passed.
import logging
logging.warning('debug_3: {}'.format(self.compression))
logging.warning('debug_4: {}'.format(self.path_or_buf))
logging.warning(
'debug_5: {}'.format(hasattr(self.path_or_buf, 'write')))
if self.compression and hasattr(self.path_or_buf, 'write'):
logging.warning('debug_6: in loop, should RuntimeWarn')
msg = ("compression has no effect when passing file-like "
"object as input.")
warnings.warn(msg, RuntimeWarning, stacklevel=2)
Expand Down
2 changes: 1 addition & 1 deletion pandas/io/json/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
# interface to/from
def to_json(path_or_buf, obj, orient=None, date_format='epoch',
double_precision=10, force_ascii=True, date_unit='ms',
default_handler=None, lines=False, compression=None,
default_handler=None, lines=False, compression='infer',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where to update the to_json docs... didn't see a docstring in this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert the object to a JSON string.

index=True):

if not index and orient not in ['split', 'table']:
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/io/formats/test_to_csv.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# -*- coding: utf-8 -*-

import gzip
import sys

import pytest
Expand Down Expand Up @@ -351,3 +352,15 @@ def test_to_csv_compression(self, compression_only,
result = pd.read_csv(path, index_col=0,
compression=read_compression)
tm.assert_frame_equal(result, df)

def test_compression_defaults_to_infer(tmpdir):
"""
Test that to_csv defaults to inferring compression from paths.
https://github.com/pandas-dev/pandas/pull/22011
"""
df = DataFrame({"A": [1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I suppose this "works" it is definitely focused on gzip and doesn't make any assertions about how the behavior works across a combination of infer, None and an explicit compression.

We have a fixture called compression in conftest.py that covers the various compression options - is there a way to parametrize that as part of this test and make more comprehensive assertions about the potential combinations of behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible values that compression can take for to_csv are already tested. The purpose of this test is simply to test that compression is defaulting to 'infer'. I kept the test simple so that it ONLY tests that the default is infer and won't break or malfunction should other aspects of the API change. The test will break if the default for compression is put back to None (and hopefully not for other reasons).

I think it may make sense to make a similar test for to_json, but don't see how testing compression='infer' for all possible compression extensions is within scope of this PR as this PR just changes the default and is not creating the infer option. Is the worry that there is currently inadequate testing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that this test only makes sure that gzip compression works by default with infer, but how does it guarantee that other compression types play nice with infer? I'll admit it's a subtle distinction but nuances like that do pop up and it doesn't seem like it should be that much more work to leverage the existing compression fixture for this test to increase coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be that much more work to leverage the existing compression fixture for this test to increase coverage

In abd19e3, I modified an existing parametrized test to look for compression by default for paths where inference should occur. This actually caught an issue (we hadn't switched default for Series.to_csv -- fixed in 2f670fe).

Should I delete the gzip test? Note the parametrized test doesn't test that the right compression is occurring, just that a compression is occurring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of going about it in this fashion can you not do a round trip with to_csv and read_csv using infer with a file extension for the former and the parametrized value as an argument for compression in the latter? So in pseudo-code:

with tm.ensure_clean('compressed.csv.{}'.format(compression_only)) as path:
    df.to_csv(path)
    result = pd.read_csv(path, compression=compression_only)

tm.assert_frame_equal(result, df)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the roundtrip approach here, since it never actually tests that the file is compressed on disk. Given that the to_* and read_* methods rely on much of the same compression infrastructure, I think it's possible to modify the code such that all compression gets disabled and the roudtrip works perfectly. Now hopefully there are enough other tests to catch such a situation.

with tm.ensure_clean('compressed.csv.gz') as path:
df.to_csv(path, index=False)
with gzip.open(path, 'rt') as read_file:
lines = read_file.read().splitlines()
assert lines == ['A', '1']
3 changes: 3 additions & 0 deletions pandas/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,8 +257,11 @@ def test_compression_warning(compression_only):
[12.32112, 123123.2, 321321.2]],
columns=['X', 'Y', 'Z'])
with tm.ensure_clean() as filename:
import logging
logging.warning('debug_1: {}'.format(compression_only))
f, _handles = _get_handle(filename, 'w', compression=compression_only)
with tm.assert_produces_warning(RuntimeWarning,
check_stacklevel=False):
with f:
logging.warning('debug_2: {}'.format(compression_only))
df.to_csv(f, compression=compression_only)