Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch arrow type for string array to large string #56220
Switch arrow type for string array to large string #56220
Changes from 11 commits
752e368
d813e8d
ed07537
c0c42a8
3196f32
e807652
1341ffc
6dc3f20
848f7ed
7244889
a24ebac
3d90cc7
46d7f16
3fdf256
c2bd9d2
a01d4e5
a22e625
847b74c
47fda87
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this cast? (and maybe add a comment about it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow is inferring this as regular strings, I think we had failing tests without this cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm still confused about this as well.
if arr.dtype == "string":
we are still casting topa.string()
? What would the result type ofpa.array(arr, from_pandas=True)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm the comment above was incorrect, its like this:
We are now using large_string in our String Extension arrays, e.g. if you convert this to an ArrowExtensionArray it will also be large_string. This is inconsistent with the other I/O methods where ArrowExtensionArray is still pa.string, that's why I am casting it back here.
I am happy to change this as well, but rather in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay that makes sense. I'm OK with this then but would be good to have a
# TODO
noting we may want to keep large_string here in the futureThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it's only the python->arrow converter that doesn't seem to implement this, but creating a dictionary array with large string in pyarrow itself is certainly supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, it looks a bit strange that we actually allow creating a string column backed by a dictionary array. It would be nice that long-term we support this, but right now many operations will just fail (eg all string compute functions from pyarrow will fail on a dictionary[string] type).
I think for fixing #53951, instead of allowing dictionary to pass through, we should rather convert the dictionary to a plain string array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this as a follow up, but I don't think that this is a real use case anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report in #53951 is a real use case, though (and that will now create such dictionary-backed string column), AFAIU
But indeed for a different issue/PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this also happening on main? maybe I am misunderstanding something