Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch arrow type for string array to large string #56220

Merged
merged 19 commits into from
Dec 21, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 28, 2023

large string is a more sensible default (take concatenates the chunks in pyarrow which can cause overflows pretty quickly), large string should avoid this

one todo for a follow up:

  • ensure interoperability with "string[pyarrow]"

Let's see if CI likes this

@phofl phofl requested a review from mroeschke November 29, 2023 19:25
# roundtrip possible from arrow 1.0.0
pa = pytest.importorskip("pyarrow")

if dtype.storage == "pyarrow_numpy" and string_storage2 == "pyarrow":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just change both to be large_string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me but not sure if we should deprecate the other one first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any behavior difference expected between string and large string? If not, I don't think this needs a deprecration. I would consider it an implementation detail / feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not inside of pandas, no, but I don't know what happens if you take it outside of pandas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also change both at the same time (officially String dtype is also still considered as experimental).

It will change your schema when you convert to Arrow, and so for sure people will have things to update, although I assume (hope) it will be mostly tests that are checking the exact type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Guessing some things at the binary level (ex: pickle compatibility) might change across versions too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the change later and then we can merge

I can monitor some of the low level stuff on the Dask CI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to switch to large strings for both

@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Nov 30, 2023
@phofl phofl added this to the 2.2 milestone Dec 9, 2023
@phofl
Copy link
Member Author

phofl commented Dec 10, 2023

cc @mroeschke this is green now (pending merge conflicts)

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated
Comment on lines 177 to 182
result_arrays = []
for arr in arrays:
pa_array = pa.array(arr, from_pandas=True)
if arr.dtype == "string":
pa_array = pc.cast(pa_array, pa.string())
result_arrays.append(ArrowExtensionArray(pa_array))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this cast? (and maybe add a comment about it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow is inferring this as regular strings, I think we had failing tests without this cast

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm still confused about this as well. if arr.dtype == "string": we are still casting to pa.string()? What would the result type of pa.array(arr, from_pandas=True)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm the comment above was incorrect, its like this:

We are now using large_string in our String Extension arrays, e.g. if you convert this to an ArrowExtensionArray it will also be large_string. This is inconsistent with the other I/O methods where ArrowExtensionArray is still pa.string, that's why I am casting it back here.

I am happy to change this as well, but rather in a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay that makes sense. I'm OK with this then but would be good to have a # TODO noting we may want to keep large_string here in the future

@pytest.mark.parametrize("chunked", [True, False])
def test_constructor_valid_string_type_value_dictionary(chunked):
pa = pytest.importorskip("pyarrow")

arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.utf8()))
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string()))
arr = pa.array(["1", "2", "3"], pa.large_string()).dictionary_encode()

(it's only the python->arrow converter that doesn't seem to implement this, but creating a dictionary array with large string in pyarrow itself is certainly supported)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it looks a bit strange that we actually allow creating a string column backed by a dictionary array. It would be nice that long-term we support this, but right now many operations will just fail (eg all string compute functions from pyarrow will fail on a dictionary[string] type).

I think for fixing #53951, instead of allowing dictionary to pass through, we should rather convert the dictionary to a plain string array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this as a follow up, but I don't think that this is a real use case anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report in #53951 is a real use case, though (and that will now create such dictionary-backed string column), AFAIU

But indeed for a different issue/PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this also happening on main? maybe I am misunderstanding something

@phofl
Copy link
Member Author

phofl commented Dec 15, 2023

this is green now, so should be ready to merge

# Conflicts:
#	pandas/core/arrays/arrow/array.py
pandas/io/sql.py Outdated Show resolved Hide resolved
@phofl phofl merged commit 2488e5e into pandas-dev:main Dec 21, 2023
45 checks passed
@phofl phofl deleted the large_string branch December 21, 2023 21:05
cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Feb 20, 2024
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733

This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220

This PR fixes all 25 `dask-cudf` failures.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Ashwin Srinath (https://github.com/shwina)

URL: #15093
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: new string dtype fails with >2 GB of data in a single column
4 participants