Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

Merged
merged 7 commits into from
Apr 11, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Apr 8, 2023

should we exclude any other arrow types from the conversion?

@phofl phofl added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Apr 8, 2023
@@ -555,4 +555,19 @@ def time_read_csv_index_col(self):
read_csv(self.StringIO_input, index_col="a")


class ReadCSVDatePyarrowEngine(StringIORewind):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fuse this benchmark with ReadCSVParseDates above?
(You might have to disable time_multiple_dates for the pyarrow engine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer keeping them separate for now since I am expecting to add more arrow versions that aren't really compatible with the others.

@lithomas1
Copy link
Member

I'm not sure we should backport this to 2.0.x.

Parse dates support is still one of the more broken things for the pyarrow parser, so a lot of the date handling stuff could be subject to change.

(I am planning to unstale #50056 once my other pyarrow PR is merged).

@phofl
Copy link
Member Author

phofl commented Apr 9, 2023

We almost certainly should backport this. dtype_backend is new, so I am not really concerned with compatibility issues right now (this is only hit if dtype_backend is set).

Reasons to backport:

  • if parse_dates is set and arrow already guessed the format we are still converting back to NumPy (shouldn't do this)
  • Performance! I have an example from medium where main runs 7 seconds and this version runs 0.025 seconds. We should avoid calling to_datetime for ArrowExtensionArrays as far as possible and since this makes stuff actually more consistent (you get an arrow dtype instead of a NumPy dtype) I don't see a reason why we shouldn't backport.

@phofl phofl added this to the 2.0.1 milestone Apr 9, 2023
@lithomas1 lithomas1 added the Arrow pyarrow functionality label Apr 10, 2023
import pyarrow as pa

dtype = data_dict[colspec].dtype
if isinstance(dtype, pd.ArrowDtype) and pa.types.is_timestamp(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to keep date types too i.e. pa.types.is_date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -60,6 +60,7 @@
)
from pandas.core.dtypes.missing import isna

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we just import ArrowDtype below instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lithomas1
Copy link
Member

Reasons to backport:

  • if parse_dates is set and arrow already guessed the format we are still converting back to NumPy (shouldn't do this)
  • Performance! I have an example from medium where main runs 7 seconds and this version runs 0.025 seconds. We should avoid calling to_datetime for ArrowExtensionArrays as far as possible and since this makes stuff actually more consistent (you get an arrow dtype instead of a NumPy dtype) I don't see a reason why we shouldn't backport.

Thanks for clarifying. This is fine then, since it fixes the issue with us ending up with numpy dtype (when we should get the Arrow dtype).

One last thing, can you add a whatsnew note since the performance difference is pretty big (as you noted)?

@phofl
Copy link
Member Author

phofl commented Apr 10, 2023

I'll phrase it more like a bug fix, but will mention performance as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: read_csv should check if column is already a datetime column before initiating the conversion
3 participants