-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch arrow type for string array to large string #56220
Conversation
# roundtrip possible from arrow 1.0.0 | ||
pa = pytest.importorskip("pyarrow") | ||
|
||
if dtype.storage == "pyarrow_numpy" and string_storage2 == "pyarrow": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should just change both to be large_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me but not sure if we should deprecate the other one first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any behavior difference expected between string and large string? If not, I don't think this needs a deprecration. I would consider it an implementation detail / feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not inside of pandas, no, but I don't know what happens if you take it outside of pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also change both at the same time (officially String dtype is also still considered as experimental).
It will change your schema when you convert to Arrow, and so for sure people will have things to update, although I assume (hope) it will be mostly tests that are checking the exact type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Guessing some things at the binary level (ex: pickle compatibility) might change across versions too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the change later and then we can merge
I can monitor some of the low level stuff on the Dask CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to switch to large strings for both
cc @mroeschke this is green now (pending merge conflicts) |
pandas/io/sql.py
Outdated
result_arrays = [] | ||
for arr in arrays: | ||
pa_array = pa.array(arr, from_pandas=True) | ||
if arr.dtype == "string": | ||
pa_array = pc.cast(pa_array, pa.string()) | ||
result_arrays.append(ArrowExtensionArray(pa_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this cast? (and maybe add a comment about it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow is inferring this as regular strings, I think we had failing tests without this cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm still confused about this as well. if arr.dtype == "string":
we are still casting to pa.string()
? What would the result type of pa.array(arr, from_pandas=True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm the comment above was incorrect, its like this:
We are now using large_string in our String Extension arrays, e.g. if you convert this to an ArrowExtensionArray it will also be large_string. This is inconsistent with the other I/O methods where ArrowExtensionArray is still pa.string, that's why I am casting it back here.
I am happy to change this as well, but rather in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay that makes sense. I'm OK with this then but would be good to have a # TODO
noting we may want to keep large_string here in the future
@pytest.mark.parametrize("chunked", [True, False]) | ||
def test_constructor_valid_string_type_value_dictionary(chunked): | ||
pa = pytest.importorskip("pyarrow") | ||
|
||
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.utf8())) | ||
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string())) | |
arr = pa.array(["1", "2", "3"], pa.large_string()).dictionary_encode() |
(it's only the python->arrow converter that doesn't seem to implement this, but creating a dictionary array with large string in pyarrow itself is certainly supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, it looks a bit strange that we actually allow creating a string column backed by a dictionary array. It would be nice that long-term we support this, but right now many operations will just fail (eg all string compute functions from pyarrow will fail on a dictionary[string] type).
I think for fixing #53951, instead of allowing dictionary to pass through, we should rather convert the dictionary to a plain string array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this as a follow up, but I don't think that this is a real use case anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report in #53951 is a real use case, though (and that will now create such dictionary-backed string column), AFAIU
But indeed for a different issue/PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this also happening on main? maybe I am misunderstanding something
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
this is green now, so should be ready to merge |
# Conflicts: # pandas/core/arrays/arrow/array.py
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733 This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220 This PR fixes all 25 `dask-cudf` failures. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #15093
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.large string is a more sensible default (take concatenates the chunks in pyarrow which can cause overflows pretty quickly), large string should avoid this
one todo for a follow up:
"string[pyarrow]"
Let's see if CI likes this