-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv not converting to float for python engine with decimal sep, usecols and parse_dates #38334
Conversation
…sep, usecols and parse_dates
pandas/io/parsers.py
Outdated
@@ -2354,12 +2354,16 @@ def _set_no_thousands_columns(self): | |||
# Create a set of column ids that are not to be stripped of thousands | |||
# operators. | |||
noconvert_columns = set() | |||
if self._col_indices is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort _col_indices when its created. I would try actually just setting it to the list(range(len(self.columns))) otherwise when its set (rather than doing it here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it up and sorted when set immediately
ci/checks is failing |
Fixed the mypy issues. If we do it this way, we are more type safe than before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing mypy-related comments.
pandas/io/parsers.py
Outdated
# pandas\io\parsers.py:3159: error: Unsupported right | ||
# operand type for in ("Optional[Any]") [operator] | ||
or i - len(self.index_col) # type: ignore[operator] | ||
in self._col_indices | ||
or i - len(self.index_col) in self._col_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you do not ignore mypy checking anymore, then the comments above are irrelevant.
Maybe xref #37715?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, that is a good point.
pandas/io/parsers.py
Outdated
@@ -3203,7 +3203,7 @@ def _rows_to_cols(self, content): | |||
# operand type for in ("Optional[Any]") [operator] | |||
a | |||
for i, a in enumerate(zipped_content) | |||
if i in self._col_indices # type: ignore[operator] | |||
if i in self._col_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (comment above about mypy error).
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -740,6 +740,7 @@ I/O | |||
- Bug in :meth:`DataFrame.to_hdf` was not dropping missing rows with ``dropna=True`` (:issue:`35719`) | |||
- Bug in :func:`read_html` was raising a ``TypeError`` when supplying a ``pathlib.Path`` argument to the ``io`` parameter (:issue:`37705`) | |||
- :meth:`DataFrame.to_excel`, :meth:`Series.to_excel`, :meth:`DataFrame.to_markdown`, and :meth:`Series.to_markdown` now support writing to fsspec URLs such as S3 and Google Cloud Storage (:issue:`33987`) | |||
- Bug in :meth:`read_csv` returning object dtype when ``delimiter=","`` with ``usecols`` and ``parse_dates`` specified for ``engine="python"`` (:issue:`35873`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be removed off of 1.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm weird, must have missed that- Thx
pandas/io/parsers.py
Outdated
@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds): | |||
if self.index_names is None: | |||
self.index_names = index_names | |||
|
|||
if not hasattr(self, "_col_indices"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we always define this way on L2310 or L2297?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have acces to self.columns in L2297. Sometimes we set this in _infer_columns
in L2302, so we have to check if it was already set to not override. Could move it up a bit but would have to check if it exists nevertheless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you instead define _col_list = None then as the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work of course, but we would get the mypy problems in again. Could do that nevertheless if this is preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i think its important that this is always defined. you can use Optional[List[int]]
, then you assert its not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[List[int]] unfortunately raises the mypy error too. assert is not None does not help, so added the ignores back in
@@ -314,3 +314,19 @@ def test_malformed_skipfooter(python_parser_only): | |||
msg = "Expected 3 fields in line 4, saw 5" | |||
with pytest.raises(ParserError, match=msg): | |||
parser.read_csv(StringIO(data), header=1, comment="#", skipfooter=1) | |||
|
|||
|
|||
def test_delimiter_with_usecols_and_parse_dates(python_parser_only): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this not work in c-parser? why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for c too, not quite sure why I tested only for python. Moved it
this looked good last time, can you merge master |
Done |
pandas/io/parsers.py
Outdated
@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds): | |||
if self.index_names is None: | |||
self.index_names = index_names | |||
|
|||
if not hasattr(self, "_col_indices"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you instead define _col_list = None then as the default
pandas/io/parsers.py
Outdated
@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds): | |||
if self.index_names is None: | |||
self.index_names = index_names | |||
|
|||
if not hasattr(self, "_col_indices"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i think its important that this is always defined. you can use Optional[List[int]]
, then you assert its not None
� Conflicts: � pandas/tests/io/parser/test_dtypes.py
pandas/io/parsers.py
Outdated
@@ -2358,7 +2361,11 @@ def _set(x): | |||
if is_integer(x): | |||
noconvert_columns.add(x) | |||
else: | |||
noconvert_columns.add(self.columns.index(x)) | |||
# pandas\io\parsers.py:2366: error: Unsupported right | |||
# operand type for in ("Optional[List[int]") [index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to assert that the result values is not None (assign it to a new variable first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much. That helped me a lot.
pandas/io/parsers.py
Outdated
@@ -3186,16 +3192,16 @@ def _rows_to_cols(self, content): | |||
for i, a in enumerate(zipped_content) | |||
if ( | |||
i < len(self.index_col) | |||
# pandas\io\parsers.py:3159: error: Unsupported right | |||
# operand type for in ("Optional[Any]") [operator] | |||
# pandas\io\parsers.py:3198: error: Unsupported right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
seems to have gotten commited? |
These folders were missing init files, so I have added them. Shall I remove them? |
nope this is fine, didnt' realize they weren't there |
…sep, usecols and parse_dates (pandas-dev#38334)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
self.columns is alread resticted to usecols, while data is not restricted yet, so we have use the initial column indices