-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548
Conversation
…for datetime conversion
@@ -555,4 +555,19 @@ def time_read_csv_index_col(self): | |||
read_csv(self.StringIO_input, index_col="a") | |||
|
|||
|
|||
class ReadCSVDatePyarrowEngine(StringIORewind): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fuse this benchmark with ReadCSVParseDates
above?
(You might have to disable time_multiple_dates
for the pyarrow engine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer keeping them separate for now since I am expecting to add more arrow versions that aren't really compatible with the others.
I'm not sure we should backport this to 2.0.x. Parse dates support is still one of the more broken things for the pyarrow parser, so a lot of the date handling stuff could be subject to change. (I am planning to unstale #50056 once my other pyarrow PR is merged). |
We almost certainly should backport this. dtype_backend is new, so I am not really concerned with compatibility issues right now (this is only hit if dtype_backend is set). Reasons to backport:
|
pandas/io/parsers/base_parser.py
Outdated
import pyarrow as pa | ||
|
||
dtype = data_dict[colspec].dtype | ||
if isinstance(dtype, pd.ArrowDtype) and pa.types.is_timestamp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make sense to keep date types too i.e. pa.types.is_date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pandas/io/parsers/base_parser.py
Outdated
@@ -60,6 +60,7 @@ | |||
) | |||
from pandas.core.dtypes.missing import isna | |||
|
|||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we just import ArrowDtype
below instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for clarifying. This is fine then, since it fixes the issue with us ending up with numpy dtype (when we should get the Arrow dtype). One last thing, can you add a whatsnew note since the performance difference is pretty big (as you noted)? |
I'll phrase it more like a bug fix, but will mention performance as well |
…ine and dtype_backend=pyarrow for datetime conversion
…ow engine and dtype_backend=pyarrow for datetime conversion) (#52592) Backport PR #52548: PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion Co-authored-by: Patrick Hoefler <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.should we exclude any other arrow types from the conversion?