Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT/TST: Ensure that numpy 2 string dtype converts to object array in pandas 2.2.x #58104

Closed
lithomas1 opened this issue Apr 1, 2024 · 6 comments · Fixed by #58202
Closed
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Testing pandas testing functions or related to the test suite
Milestone

Comments

@lithomas1
Copy link
Member

We should add a test for this, given that numpy 2.0 support is happening for pandas 2.2.2.

@lithomas1 lithomas1 added Testing pandas testing functions or related to the test suite Compat pandas objects compatability with Numpy or Python functions labels Apr 1, 2024
@lithomas1 lithomas1 added this to the 2.2.2 milestone Apr 1, 2024
@lithomas1 lithomas1 added the Blocker Blocking issue or pull request for an upcoming release label Apr 1, 2024
@jbrockmendel
Copy link
Member

Wouldn’t not-converting be nicer behavior?

@lithomas1
Copy link
Member Author

Wouldn’t not-converting be nicer behavior?

I was hoping we could get @ngoldbaum 's numpy string array feature in in some form for 3.0, but that's probably too big of a change to get in for 2.2.2.

The main thing I want to avoid is a weird error happening somewhere in our internals if a numpy string array is used.

@ngoldbaum
Copy link
Contributor

One big issue with just accepting stringdtype arrays is that stringdtype can’t support the buffer protocol, so a lot of cython code in pandas that expects an array type that supports the buffer protocol won’t work. My fork of pandas adding stringdtype support avoids low-level cython operations by calling numpy directly.

@mroeschke
Copy link
Member

I would also support not coercing to object by default if possible (IMO it's OK postponing 2.2.2 extra to wait for compatibility)

Just curious, in numpy 2.0, will np.array(["1"]) default to the new string type?

@ngoldbaum
Copy link
Contributor

ngoldbaum commented Apr 2, 2024

Just curious, in numpy 2.0, will np.array(["1"]) default to the new string type?

No, someone needs to explicitly pass e.g. dtype=np.dtypes.StringDType() or dtype="T" somewhere to create a stringdtype array. Maybe in numpy 3.0 we'll change the default.

If there's interest in helping me out with getting my stringdtype changes upstreamed, I'd really appreciate it. My current latest work is here. Currently it's based on the pandas 3.0 branch but backporting it wouldn't be hard. I was hoping to have a PR open already but between getting pulled into other projects and helping out with shipping numpy 2.0 I didn't quite make it in time.

@phofl
Copy link
Member

phofl commented Apr 4, 2024

I don't think we should back port this, that change is too large imo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants