Switch arrow type for string array to large string #56220

phofl · 2023-11-28T11:48:41Z

closes BUG: new string dtype fails with >2 GB of data in a single column #56259
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

large string is a more sensible default (take concatenates the chunks in pyarrow which can cause overflows pretty quickly), large string should avoid this

one todo for a follow up:

ensure interoperability with "string[pyarrow]"

Let's see if CI likes this

mroeschke · 2023-11-29T19:50:43Z

pandas/tests/arrays/string_/test_string.py

    # roundtrip possible from arrow 1.0.0
    pa = pytest.importorskip("pyarrow")

+    if dtype.storage == "pyarrow_numpy" and string_storage2 == "pyarrow":


IMO we should just change both to be large_string

Fine by me but not sure if we should deprecate the other one first?

Is there any behavior difference expected between string and large string? If not, I don't think this needs a deprecration. I would consider it an implementation detail / feature

Not inside of pandas, no, but I don't know what happens if you take it outside of pandas

I would also change both at the same time (officially String dtype is also still considered as experimental).

It will change your schema when you convert to Arrow, and so for sure people will have things to update, although I assume (hope) it will be mostly tests that are checking the exact type.

Makes sense. Guessing some things at the binary level (ex: pickle compatibility) might change across versions too

I'll make the change later and then we can merge

I can monitor some of the low level stuff on the Dask CI

Updated to switch to large strings for both

# Conflicts: # pandas/core/arrays/arrow/array.py

phofl · 2023-12-10T15:34:22Z

cc @mroeschke this is green now (pending merge conflicts)

pandas/core/arrays/string_arrow.py

jorisvandenbossche · 2023-12-14T20:42:33Z

pandas/io/sql.py

+        result_arrays = []
+        for arr in arrays:
+            pa_array = pa.array(arr, from_pandas=True)
+            if arr.dtype == "string":
+                pa_array = pc.cast(pa_array, pa.string())
+            result_arrays.append(ArrowExtensionArray(pa_array))


What's the reason for this cast? (and maybe add a comment about it)

arrow is inferring this as regular strings, I think we had failing tests without this cast

Yeah I'm still confused about this as well. if arr.dtype == "string": we are still casting to pa.string()? What would the result type of pa.array(arr, from_pandas=True)?

Hm the comment above was incorrect, its like this:

We are now using large_string in our String Extension arrays, e.g. if you convert this to an ArrowExtensionArray it will also be large_string. This is inconsistent with the other I/O methods where ArrowExtensionArray is still pa.string, that's why I am casting it back here.

I am happy to change this as well, but rather in a follow up

Ah okay that makes sense. I'm OK with this then but would be good to have a # TODO noting we may want to keep large_string here in the future

jorisvandenbossche · 2023-12-14T20:49:29Z

pandas/tests/arrays/string_/test_string_arrow.py

 @pytest.mark.parametrize("chunked", [True, False])
 def test_constructor_valid_string_type_value_dictionary(chunked):
    pa = pytest.importorskip("pyarrow")

-    arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.utf8()))
+    arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string()))


Suggested change

arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string()))

arr = pa.array(["1", "2", "3"], pa.large_string()).dictionary_encode()

(it's only the python->arrow converter that doesn't seem to implement this, but creating a dictionary array with large string in pyarrow itself is certainly supported)

Additionally, it looks a bit strange that we actually allow creating a string column backed by a dictionary array. It would be nice that long-term we support this, but right now many operations will just fail (eg all string compute functions from pyarrow will fail on a dictionary[string] type).

I think for fixing #53951, instead of allowing dictionary to pass through, we should rather convert the dictionary to a plain string array?

We can do this as a follow up, but I don't think that this is a real use case anyway

The report in #53951 is a real use case, though (and that will now create such dictionary-backed string column), AFAIU

But indeed for a different issue/PR

isn't this also happening on main? maybe I am misunderstanding something

Co-authored-by: Joris Van den Bossche <[email protected]>

phofl · 2023-12-15T22:47:01Z

this is green now, so should be ready to merge

# Conflicts: # pandas/core/arrays/arrow/array.py

pandas/io/sql.py

This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733 This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220 This PR fixes all 25 `dask-cudf` failures. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #15093

phofl added 2 commits November 28, 2023 12:47

Switch arrow type for string array to large string

752e368

Fix concat issue

d813e8d

phofl requested a review from mroeschke November 29, 2023 19:25

mroeschke reviewed Nov 29, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 30, 2023

BUG: new string dtype fails with >2 GB of data in a single column #56259

Closed

mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Nov 30, 2023

phofl and others added 6 commits November 30, 2023 21:38

Update

ed07537

Update

c0c42a8

Update tests

3196f32

Fix test

e807652

Merge remote-tracking branch 'upstream/main' into large_string

1341ffc

# Conflicts: # pandas/core/arrays/arrow/array.py

Fixup

6dc3f20

phofl added this to the 2.2 milestone Dec 9, 2023

phofl and others added 3 commits December 10, 2023 00:12

Fixup

848f7ed

Fixup

7244889

Merge branch 'main' into large_string

a24ebac

jorisvandenbossche reviewed Dec 14, 2023

View reviewed changes

phofl and others added 4 commits December 14, 2023 22:07

Update pandas/tests/arrays/string_/test_string_arrow.py

3d90cc7

Co-authored-by: Joris Van den Bossche <[email protected]>

Update pandas/core/arrays/string_arrow.py

46d7f16

Co-authored-by: Joris Van den Bossche <[email protected]>

Update string_arrow.py

3fdf256

FIxup

c2bd9d2

Merge remote-tracking branch 'upstream/main' into large_string

a01d4e5

# Conflicts: # pandas/core/arrays/arrow/array.py

mroeschke reviewed Dec 21, 2023

View reviewed changes

pandas/io/sql.py Outdated Show resolved Hide resolved

Update

a22e625

mroeschke approved these changes Dec 21, 2023

View reviewed changes

phofl added 2 commits December 21, 2023 19:44

Add todo

847b74c

Fixup

47fda87

jorisvandenbossche approved these changes Dec 21, 2023

View reviewed changes

phofl merged commit 2488e5e into pandas-dev:main Dec 21, 2023
45 checks passed

phofl deleted the large_string branch December 21, 2023 21:05

cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024

Switch arrow type for string array to large string (pandas-dev#56220)

5ffd1f1

AlenkaF mentioned this pull request Jan 8, 2024

[CI][Python] Tests with pandas nightlies have been failing for the last days witn NotImplementedError on S3 dtype apache/arrow#39437

Closed

galipremsagar mentioned this pull request Feb 20, 2024

Add support for arrow large_string in cudf rapidsai/cudf#15093

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch arrow type for string array to large string #56220

Switch arrow type for string array to large string #56220

phofl commented Nov 28, 2023 •

edited by jorisvandenbossche

Loading

mroeschke Nov 29, 2023

phofl Nov 29, 2023

WillAyd Nov 30, 2023

phofl Nov 30, 2023

jorisvandenbossche Nov 30, 2023

WillAyd Nov 30, 2023

phofl Nov 30, 2023

phofl Nov 30, 2023

phofl commented Dec 10, 2023

jorisvandenbossche Dec 14, 2023

phofl Dec 14, 2023

mroeschke Dec 21, 2023

phofl Dec 21, 2023

mroeschke Dec 21, 2023

jorisvandenbossche Dec 14, 2023

jorisvandenbossche Dec 14, 2023

phofl Dec 14, 2023

jorisvandenbossche Dec 14, 2023

phofl Dec 14, 2023

phofl commented Dec 15, 2023

	arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string()))
	arr = pa.array(["1", "2", "3"], pa.large_string()).dictionary_encode()

Switch arrow type for string array to large string #56220

Switch arrow type for string array to large string #56220

Conversation

phofl commented Nov 28, 2023 • edited by jorisvandenbossche Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Dec 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Dec 15, 2023

phofl commented Nov 28, 2023 •

edited by jorisvandenbossche

Loading