Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Series.str.get_dummies for ArrowDtype(pa.string()) #53655

Merged
merged 3 commits into from
Jun 14, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Jun 13, 2023

import pandas as pd
import pyarrow as pa

data = ["a|b|c", "a|b", "a|c", "b|c", "a", "b", "c"] * 1000
ser = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))

%timeit ser.str.get_dummies() 

# 549 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> main
# 8.67 ms ± 415 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  -> PR

@lukemanley lukemanley added Performance Memory or execution speed performance Strings String extension data type and string data Arrow pyarrow functionality labels Jun 13, 2023
@lukemanley lukemanley added this to the 2.1 milestone Jun 13, 2023
@mroeschke mroeschke merged commit e27b0e7 into pandas-dev:main Jun 14, 2023
@mroeschke
Copy link
Member

Nice optimizations as usual. Thanks @lukemanley

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Jun 15, 2023
…53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing
mroeschke added a commit that referenced this pull request Jun 21, 2023
* CI: Build pandas even if doctests fail

* BUG: groupby sum turning `inf+inf` and `(-inf)+(-inf)` into `nan` (#53623)

* DEPR: method, limit in NDFrame.replace (#53492)

* DEPR: method, limit in NDFrame.replace

* update test, docs

* suppress doctest warning

* doctests

* PERF: Series.str.get_dummies for ArrowDtype(pa.string()) (#53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing

* TYP: core.missing (#53625)

* CI: Attempt to fix wheel builds (#53670)

* DOC: Fixing EX01 - Added examples (#53647)

* SeriesGroupBy.fillna example added

* Added examples

* Corrected failing test for timedelta.total_seconds

* Corrected fillna example

* CI/TST: Mark test_to_read_gcs as single_cpu (#53677)

* BUG/CoW: is_range_indexer can't handle very large arrays (#53672)

* BUG: is_range_indexer can't handle very large arrays

* fix test on 32-bit

* TST: Use more pytest fixtures

---------

Co-authored-by: Yao Xiao <[email protected]>
Co-authored-by: jbrockmendel <[email protected]>
Co-authored-by: Luke Manley <[email protected]>
Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: Dea María Léon <[email protected]>
@lukemanley lukemanley deleted the arrow-str-get-dummies branch June 22, 2023 21:59
canthonyscott pushed a commit to canthonyscott/pandas-anthony that referenced this pull request Jun 23, 2023
…53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing
canthonyscott pushed a commit to canthonyscott/pandas-anthony that referenced this pull request Jun 23, 2023
* CI: Build pandas even if doctests fail

* BUG: groupby sum turning `inf+inf` and `(-inf)+(-inf)` into `nan` (pandas-dev#53623)

* DEPR: method, limit in NDFrame.replace (pandas-dev#53492)

* DEPR: method, limit in NDFrame.replace

* update test, docs

* suppress doctest warning

* doctests

* PERF: Series.str.get_dummies for ArrowDtype(pa.string()) (pandas-dev#53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing

* TYP: core.missing (pandas-dev#53625)

* CI: Attempt to fix wheel builds (pandas-dev#53670)

* DOC: Fixing EX01 - Added examples (pandas-dev#53647)

* SeriesGroupBy.fillna example added

* Added examples

* Corrected failing test for timedelta.total_seconds

* Corrected fillna example

* CI/TST: Mark test_to_read_gcs as single_cpu (pandas-dev#53677)

* BUG/CoW: is_range_indexer can't handle very large arrays (pandas-dev#53672)

* BUG: is_range_indexer can't handle very large arrays

* fix test on 32-bit

* TST: Use more pytest fixtures

---------

Co-authored-by: Yao Xiao <[email protected]>
Co-authored-by: jbrockmendel <[email protected]>
Co-authored-by: Luke Manley <[email protected]>
Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: Dea María Léon <[email protected]>
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
* CI: Build pandas even if doctests fail

* BUG: groupby sum turning `inf+inf` and `(-inf)+(-inf)` into `nan` (pandas-dev#53623)

* DEPR: method, limit in NDFrame.replace (pandas-dev#53492)

* DEPR: method, limit in NDFrame.replace

* update test, docs

* suppress doctest warning

* doctests

* PERF: Series.str.get_dummies for ArrowDtype(pa.string()) (pandas-dev#53655)

* PERF: Series.str.get_dummies for ArrowDtype(pa.string())

* whatsnew

* typing

* TYP: core.missing (pandas-dev#53625)

* CI: Attempt to fix wheel builds (pandas-dev#53670)

* DOC: Fixing EX01 - Added examples (pandas-dev#53647)

* SeriesGroupBy.fillna example added

* Added examples

* Corrected failing test for timedelta.total_seconds

* Corrected fillna example

* CI/TST: Mark test_to_read_gcs as single_cpu (pandas-dev#53677)

* BUG/CoW: is_range_indexer can't handle very large arrays (pandas-dev#53672)

* BUG: is_range_indexer can't handle very large arrays

* fix test on 32-bit

* TST: Use more pytest fixtures

---------

Co-authored-by: Yao Xiao <[email protected]>
Co-authored-by: jbrockmendel <[email protected]>
Co-authored-by: Luke Manley <[email protected]>
Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: Dea María Léon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants