Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv not converting to float for python engine with decimal sep, usecols and parse_dates #38334

Merged
merged 21 commits into from
Jan 3, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 6, 2020

self.columns is alread resticted to usecols, while data is not restricted yet, so we have use the initial column indices

@phofl phofl added the IO CSV read_csv, to_csv label Dec 6, 2020
@mroeschke mroeschke added this to the 1.2 milestone Dec 6, 2020
@@ -2354,12 +2354,16 @@ def _set_no_thousands_columns(self):
# Create a set of column ids that are not to be stripped of thousands
# operators.
noconvert_columns = set()
if self._col_indices is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort _col_indices when its created. I would try actually just setting it to the list(range(len(self.columns))) otherwise when its set (rather than doing it here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it up and sorted when set immediately

@jreback jreback removed this from the 1.2 milestone Dec 7, 2020
@jreback
Copy link
Contributor

jreback commented Dec 8, 2020

ci/checks is failing

@phofl
Copy link
Member Author

phofl commented Dec 8, 2020

Fixed the mypy issues. If we do it this way, we are more type safe than before

Copy link
Member

@ivanovmg ivanovmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing mypy-related comments.

Comment on lines 3195 to 3197
# pandas\io\parsers.py:3159: error: Unsupported right
# operand type for in ("Optional[Any]") [operator]
or i - len(self.index_col) # type: ignore[operator]
in self._col_indices
or i - len(self.index_col) in self._col_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do not ignore mypy checking anymore, then the comments above are irrelevant.
Maybe xref #37715?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, that is a good point.

@@ -3203,7 +3203,7 @@ def _rows_to_cols(self, content):
# operand type for in ("Optional[Any]") [operator]
a
for i, a in enumerate(zipped_content)
if i in self._col_indices # type: ignore[operator]
if i in self._col_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (comment above about mypy error).

@@ -740,6 +740,7 @@ I/O
- Bug in :meth:`DataFrame.to_hdf` was not dropping missing rows with ``dropna=True`` (:issue:`35719`)
- Bug in :func:`read_html` was raising a ``TypeError`` when supplying a ``pathlib.Path`` argument to the ``io`` parameter (:issue:`37705`)
- :meth:`DataFrame.to_excel`, :meth:`Series.to_excel`, :meth:`DataFrame.to_markdown`, and :meth:`Series.to_markdown` now support writing to fsspec URLs such as S3 and Google Cloud Storage (:issue:`33987`)
- Bug in :meth:`read_csv` returning object dtype when ``delimiter=","`` with ``usecols`` and ``parse_dates`` specified for ``engine="python"`` (:issue:`35873`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be removed off of 1.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm weird, must have missed that- Thx

@jreback jreback added this to the 1.3 milestone Dec 14, 2020
@jreback jreback added the Bug label Dec 14, 2020
@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
if self.index_names is None:
self.index_names = index_names

if not hasattr(self, "_col_indices"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we always define this way on L2310 or L2297?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have acces to self.columns in L2297. Sometimes we set this in _infer_columns in L2302, so we have to check if it was already set to not override. Could move it up a bit but would have to check if it exists nevertheless

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you instead define _col_list = None then as the default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work of course, but we would get the mypy problems in again. Could do that nevertheless if this is preferable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think its important that this is always defined. you can use Optional[List[int]], then you assert its not None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[List[int]] unfortunately raises the mypy error too. assert is not None does not help, so added the ignores back in

@@ -314,3 +314,19 @@ def test_malformed_skipfooter(python_parser_only):
msg = "Expected 3 fields in line 4, saw 5"
with pytest.raises(ParserError, match=msg):
parser.read_csv(StringIO(data), header=1, comment="#", skipfooter=1)


def test_delimiter_with_usecols_and_parse_dates(python_parser_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this not work in c-parser? why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for c too, not quite sure why I tested only for python. Moved it

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

this looked good last time, can you merge master

@phofl
Copy link
Member Author

phofl commented Dec 29, 2020

Done

@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
if self.index_names is None:
self.index_names = index_names

if not hasattr(self, "_col_indices"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you instead define _col_list = None then as the default

@@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
if self.index_names is None:
self.index_names = index_names

if not hasattr(self, "_col_indices"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think its important that this is always defined. you can use Optional[List[int]], then you assert its not None

@@ -2358,7 +2361,11 @@ def _set(x):
if is_integer(x):
noconvert_columns.add(x)
else:
noconvert_columns.add(self.columns.index(x))
# pandas\io\parsers.py:2366: error: Unsupported right
# operand type for in ("Optional[List[int]") [index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to assert that the result values is not None (assign it to a new variable first).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much. That helped me a lot.

@@ -3186,16 +3192,16 @@ def _rows_to_cols(self, content):
for i, a in enumerate(zipped_content)
if (
i < len(self.index_col)
# pandas\io\parsers.py:3159: error: Unsupported right
# operand type for in ("Optional[Any]") [operator]
# pandas\io\parsers.py:3198: error: Unsupported right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

pandas/tests/io/parser/usecols/init.py

seems to have gotten commited?

@phofl
Copy link
Member Author

phofl commented Jan 3, 2021

These folders were missing init files, so I have added them. Shall I remove them?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

These folders were missing init files, so I have added them. Shall I remove them?

nope this is fine, didnt' realize they weren't there

@jreback jreback merged commit fb47c75 into pandas-dev:master Jan 3, 2021
@phofl phofl deleted the 35873 branch January 3, 2021 23:28
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: PythonParser does not use decimal separator when usecols and parse_date are specified
4 participants