PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

phofl · 2023-04-08T22:44:16Z

closes PERF: read_csv should check if column is already a datetime column before initiating the conversion #52546 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

should we exclude any other arrow types from the conversion?

…for datetime conversion

lithomas1 · 2023-04-09T22:20:17Z

asv_bench/benchmarks/io/csv.py

@@ -555,4 +555,19 @@ def time_read_csv_index_col(self):
        read_csv(self.StringIO_input, index_col="a")


+class ReadCSVDatePyarrowEngine(StringIORewind):


Can you fuse this benchmark with ReadCSVParseDates above?
(You might have to disable time_multiple_dates for the pyarrow engine)

I'd prefer keeping them separate for now since I am expecting to add more arrow versions that aren't really compatible with the others.

lithomas1 · 2023-04-09T23:37:05Z

I'm not sure we should backport this to 2.0.x.

Parse dates support is still one of the more broken things for the pyarrow parser, so a lot of the date handling stuff could be subject to change.

(I am planning to unstale #50056 once my other pyarrow PR is merged).

phofl · 2023-04-09T23:54:36Z

We almost certainly should backport this. dtype_backend is new, so I am not really concerned with compatibility issues right now (this is only hit if dtype_backend is set).

Reasons to backport:

if parse_dates is set and arrow already guessed the format we are still converting back to NumPy (shouldn't do this)
Performance! I have an example from medium where main runs 7 seconds and this version runs 0.025 seconds. We should avoid calling to_datetime for ArrowExtensionArrays as far as possible and since this makes stuff actually more consistent (you get an arrow dtype instead of a NumPy dtype) I don't see a reason why we shouldn't backport.

mroeschke · 2023-04-10T18:09:21Z

pandas/io/parsers/base_parser.py

+                    import pyarrow as pa
+
+                    dtype = data_dict[colspec].dtype
+                    if isinstance(dtype, pd.ArrowDtype) and pa.types.is_timestamp(


Might make sense to keep date types too i.e. pa.types.is_date

mroeschke · 2023-04-10T18:09:38Z

pandas/io/parsers/base_parser.py

@@ -60,6 +60,7 @@
 )
 from pandas.core.dtypes.missing import isna

+import pandas as pd


Nit: Can we just import ArrowDtype below instead?

lithomas1 · 2023-04-10T20:26:55Z

Reasons to backport:

if parse_dates is set and arrow already guessed the format we are still converting back to NumPy (shouldn't do this)

Performance! I have an example from medium where main runs 7 seconds and this version runs 0.025 seconds. We should avoid calling to_datetime for ArrowExtensionArrays as far as possible and since this makes stuff actually more consistent (you get an arrow dtype instead of a NumPy dtype) I don't see a reason why we shouldn't backport.

Thanks for clarifying. This is fine then, since it fixes the issue with us ending up with numpy dtype (when we should get the Arrow dtype).

One last thing, can you add a whatsnew note since the performance difference is pretty big (as you noted)?

phofl · 2023-04-10T22:41:36Z

I'll phrase it more like a bug fix, but will mention performance as well

…ine and dtype_backend=pyarrow for datetime conversion

…ow engine and dtype_backend=pyarrow for datetime conversion) (#52592) Backport PR #52548: PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion Co-authored-by: Patrick Hoefler <[email protected]>

PERF: Improve performance for arrow engine and dtype_backend=pyarrow …

c048abf

…for datetime conversion

phofl added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Apr 8, 2023

phofl added 2 commits April 9, 2023 01:09

Fix

30cf277

Fix

789a3cf

lithomas1 reviewed Apr 9, 2023

View reviewed changes

phofl added this to the 2.0.1 milestone Apr 9, 2023

lithomas1 added the Arrow pyarrow functionality label Apr 10, 2023

mroeschke reviewed Apr 10, 2023

View reviewed changes

Update

fc86c59

mroeschke approved these changes Apr 10, 2023

View reviewed changes

phofl added 2 commits April 11, 2023 00:42

Add whatsnew

c97b8c5

Add whatsnew

a40f459

lithomas1 approved these changes Apr 10, 2023

View reviewed changes

Merge branch 'main' into 52546

f71c512

phofl merged commit 66ddd5e into pandas-dev:main Apr 11, 2023

phofl deleted the 52546 branch April 11, 2023 10:07

meeseeksmachine mentioned this pull request Apr 11, 2023

Backport PR #52548 on branch 2.0.x (PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion) #52592

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Apr 11, 2023

Backport PR pandas-dev#52548: PERF: Improve performance for arrow eng…

27a526b

…ine and dtype_backend=pyarrow for datetime conversion

HarryCollins2 mentioned this pull request Jan 11, 2024

EFF Improve IsolationForest predict time scikit-learn/scikit-learn#25186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

phofl commented Apr 8, 2023 •

edited

Loading

lithomas1 Apr 9, 2023

phofl Apr 9, 2023

lithomas1 commented Apr 9, 2023

phofl commented Apr 9, 2023

mroeschke Apr 10, 2023

phofl Apr 10, 2023

mroeschke Apr 10, 2023

phofl Apr 10, 2023

lithomas1 commented Apr 10, 2023

phofl commented Apr 10, 2023

		@@ -555,4 +555,19 @@ def time_read_csv_index_col(self):
		read_csv(self.StringIO_input, index_col="a")


		class ReadCSVDatePyarrowEngine(StringIORewind):

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

PERF: Improve performance for arrow engine and dtype_backend=pyarrow for datetime conversion #52548

Conversation

phofl commented Apr 8, 2023 • edited Loading

lithomas1 Apr 9, 2023

Choose a reason for hiding this comment

phofl Apr 9, 2023

Choose a reason for hiding this comment

lithomas1 commented Apr 9, 2023

phofl commented Apr 9, 2023

mroeschke Apr 10, 2023

Choose a reason for hiding this comment

phofl Apr 10, 2023

Choose a reason for hiding this comment

mroeschke Apr 10, 2023

Choose a reason for hiding this comment

phofl Apr 10, 2023

Choose a reason for hiding this comment

lithomas1 commented Apr 10, 2023

phofl commented Apr 10, 2023

phofl commented Apr 8, 2023 •

edited

Loading