-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add 'infer' option to compression in _get_handle() #17900
Conversation
is there a related issue? |
This is related #15008, which is a much larger project. |
@Dobatymo The general idea looks good to me, but can you add some new test to confirm the behaviour? (for those read/write functions that now gained the ability to infer the compression type) |
Codecov Report
@@ Coverage Diff @@
## master #17900 +/- ##
==========================================
- Coverage 91.23% 91.21% -0.03%
==========================================
Files 163 163
Lines 50105 50105
==========================================
- Hits 45715 45704 -11
- Misses 4390 4401 +11
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17900 +/- ##
==========================================
- Coverage 91.95% 91.94% -0.01%
==========================================
Files 160 160
Lines 49858 49860 +2
==========================================
Hits 45845 45845
- Misses 4013 4015 +2
Continue to review full report at Codecov.
|
I didn't find any tests for compression in |
|
||
def test_to_csv_compression(self): | ||
import gzip | ||
import bz2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just import at the top. You're not polluting the namespace with a from
import. Also, reference the issue that you mention in the discussion under the function definition.
with bz2.BZ2File(path) as f: | ||
assert f.read() == exp | ||
|
||
def test_to_csv_compression_lzma(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference the issue that you mention in the discussion under the function definition.
@@ -223,3 +224,46 @@ def test_to_csv_multi_index(self): | |||
|
|||
exp = "foo\nbar\n1\n" | |||
assert df.to_csv(index=False) == exp | |||
|
|||
def test_to_csv_compression(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow the style of the read_csv tests for this, IOW you need to parameterize these, skipping as needed
can you rebase and update |
Sorry for the late reply. I rebased, moved the imports and added a reference to #15008 I checked the tests for EDIT: Ok, I parameterized the tests. But I kept the lzma test separate as I didn't know how to do the skipping stuff with the parameterization included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note in the v0.22.0.txt whatsnew file in the docs?
Further, Travis is failing due to this linting error:
pandas/io/pickle.py:7:1: F401 'pandas.io.common._infer_compression' imported but unused
pandas/core/frame.py
Outdated
a string representing the compression to use in the output file, | ||
allowed values are 'gzip', 'bz2', 'xz', | ||
only used when the first argument is a filename | ||
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'zip' is missing here?
(unless it is wrongly included in the docstring of get_handle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already removed 'zip' from the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly? Where did you remove it?
I don't know whether zip
is working for to_csv, therefore asking it. I just see that the line below mentions 'zip', but the line above not. And get_handle
docstring also mentions 'zip'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly tested, and 'zip' is not supported for writing, only reading. So I think it should be removed from the line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry for the misunderstanding, i did just remove the first mention of zip from that docstring, not the second one. I missed that. Like you said zip is only supported for reading, not writing.
# see GH issue 15008 | ||
|
||
df = DataFrame([1]) | ||
exp = "".join((",0", os.linesep, "0,1", os.linesep)).encode("ascii") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we encoding as ascii?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the result to be bytes not string, and in a python 2 and 3 compatible way. os.linesep
is string (bytes) in python 2 and string (unicode) in python 3.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -44,7 +44,7 @@ Other API Changes | |||
- All-NaN levels in ``MultiIndex`` are now assigned float rather than object dtype, coherently with flat indexes (:issue:`17929`). | |||
- :class:`Timestamp` will no longer silently ignore unused or invalid `tz` or `tzinfo` arguments (:issue:`17690`) | |||
- :class:`CacheableOffset` and :class:`WeekDay` are no longer available in the `tseries.offsets` module (:issue:`17830`) | |||
- | |||
- all methods which use `_get_handle()` internally, now support "infer" as compression argument (eg. `to_csv()`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rewrite this a bit from a user standpoint of view? (who does not know about _get_handle
). So if it is only to_csv
, just mention that to_csv
gained the option of 'infer' for the compression
keyword.
(you can also put it in the 'Other enhancements' section instead of api changes.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -22,7 +22,7 @@ New features | |||
Other Enhancements | |||
^^^^^^^^^^^^^^^^^^ | |||
|
|||
- | |||
- `to_csv()` and `to_json()` now support 'infer' as `compression` argument (was already supported by `to_pickle()` before) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a
compression= kwarg
. you don't need to mention pickle. add the issue number.
use :func:
~DataFrame.to_csv` and similar to
to_json``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add tests for to_json?
df = DataFrame([1]) | ||
exp = "".join((",0", os.linesep, "0,1", os.linesep)).encode("ascii") | ||
|
||
with tm.ensure_clean("test.xz") as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an additional comparison that reads back these files (you can simply use infer), then assert_frame_equal
to the original.
This is related as well #17262 |
can you rebase |
1 similar comment
can you rebase |
rebased, squashed commits and fixed lzma compat issues which arose from rebase |
can you rebase once again, having some CI issues |
@Dobatymo can you rebase |
@Dobatymo tip: if you rebase and push, we are not notified of that. So if you do that, it is best to put a short comment that you did it (ideally after the CI passed), so we can merge if possible, and don't forget it until it needs to be rebased again .. @jreback I fixed the merge conflicts. Did you have any further comments? |
let me have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rebase
assert f.read() == exp | ||
|
||
tm.assert_frame_equal(pd.read_csv(path, index_col=0, | ||
compression="infer"), df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than hard coding this pls make a new compression fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
tm.assert_frame_equal(pd.read_json(path, compression="infer"), df) | ||
|
||
|
||
@td.skip_if_no_lzma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression is now a top level fixture so this needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
@Dobatymo can you rebase |
@@ -83,6 +83,7 @@ Other Enhancements | |||
- :func:`read_html` copies cell data across ``colspan``s and ``rowspan``s, and it treats all-``th`` table rows as headers if ``header`` kwarg is not given and there is no ``thead`` (:issue:`17054`) | |||
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`) | |||
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`) | |||
- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csv, json, pickle as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this is for both reading and writing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not sure about
pickle
, as it wasn't tested in the original diff. Can check. - Nope, just for writing. Reading already has support (why we didn't support it for both reading and writing in the first place befuddles me a little...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, to_pickle
already has support for infer
per the docs. Now I'm really confused how we didn't have this for to_csv
and to_json
...
hmm, isn't infer de-facto equivalent to None? (e.g. we always infer)? why do we need an extra option here? |
@jreback :
|
@@ -67,9 +67,8 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL): | |||
>>> os.remove("./dummy.pkl") | |||
""" | |||
path = _stringify_path(path) | |||
inferred_compression = _infer_compression(path, compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung I think the answer is here, this was special cases on pickle.......
Nice inferring compression when writing dataframes will be really useful.
Unfortunately, much of the convenience of Update: I've opened an issue at #22004 |
xref #15008
xref #17262
Added 'infer' option to compression in
_get_handle()
.This way for example
.to_csv()
is made more similar topandas.read_csv()
.The default value remains
None
to keep backward compatibility.git diff upstream/master -u -- "*.py" | flake8 --diff
my first pull request. if there is something wrong, i'd be happy to change that.
I am not sure about the xfailed entries, I ran
pytest --skip-slow --skip-network pandas