-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecated usecols with out of bounds indices in read_csv #41130
Conversation
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -797,6 +797,7 @@ I/O | |||
- Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`) | |||
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`) | |||
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`) | |||
- Bug in :func:`read_csv` raising uncontrolled ``ValueError`` when ``usecols`` index is ouf of bounds, now raising ``ParserError`` (:issue:`25623`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure what "uncontrolled" means here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not raised on purpose by us but instead raised because we are accessing a non existent list index
what does this mean? |
This is a regression on master compared to 1.2.x series. So we should probably fix and then deprecate to not change behavior in 1.3 Unfortunately means, if this would not have worked on 1.2.x we could immediately start raising a ParserError without worrying about backwarts compatibility |
|
||
@pytest.mark.parametrize("header", [0, None]) | ||
@pytest.mark.parametrize("names", [None, ["a", "b"], ["a", "b", "c"]]) | ||
def test_usecols_indices_out_of_bounds(python_parser_only, names, header): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be tested with the CParser
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #41129
pandas/io/parsers/python_parser.py
Outdated
columns = [names] | ||
num_original_columns = ncols | ||
|
||
return columns, num_original_columns, unnamed_cols | ||
|
||
def _handle_usecols(self, columns, usecols_key): | ||
def _handle_usecols(self, columns, usecols_key, num_original_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief docstring on this new parameter to explain how it differs from columns
(and why we couldn't just use columns.length
in the logic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type args here
IMO okay with making this change immediately (without deprecation) because |
@gfyoung was not sure, because this is working on 1.2.x without raising an error, it is simply ignoring the indexes out of range, but I would be fine with doin this immediately |
oh maybe i misuderstood. so if this was working on 1.2.x then we should deprecate first |
Oh, I see! The w/o raising an error part puts me in agree with @jreback then. I would also then advocate for deprecation. |
Yeah same on my side. Will mark this as draft until I have fixed the error on master. Then we can switch ParserError with FutureWarning |
…623_python � Conflicts: � doc/source/whatsnew/v1.3.0.rst
# Conflicts: # doc/source/whatsnew/v1.3.0.rst
After #41244 was merged we can deprecate now for both engines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small request
pandas/io/parsers/python_parser.py
Outdated
columns = [names] | ||
num_original_columns = ncols | ||
|
||
return columns, num_original_columns, unnamed_cols | ||
|
||
def _handle_usecols(self, columns, usecols_key): | ||
def _handle_usecols(self, columns, usecols_key, num_original_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type args here
Done. should we use from future import annotations in a follow up? |
yes for sure, as that's being done elsewhere in the codebase. thanks @phofl |
This currently raises
on master but unfortunately works on 1.2.4, so we can either raise ParserError with 1.2.4 or fix and deprecate then to remove in 2.0, related to #41129
I think fixing and deprecating would be more sensible, but only realised that this works on 1.2.4 after finishing this, so wanted to put up for discussion at least :)
cc @gfyoung