Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix astype str issue 54654 #54687

Merged
merged 13 commits into from
Oct 29, 2023

Conversation

Itayazolay
Copy link
Contributor

@Itayazolay Itayazolay commented Aug 22, 2023

on pickle roundtrip astype(str) might change original array even when copy is True
@Itayazolay Itayazolay requested a review from WillAyd as a code owner August 22, 2023 11:15
@Itayazolay Itayazolay changed the title Fix astype str issue 54654 Bug: Fix astype str issue 54654 Aug 22, 2023
@Itayazolay Itayazolay changed the title Bug: Fix astype str issue 54654 BUG: Fix astype str issue 54654 Aug 22, 2023
@Itayazolay
Copy link
Contributor Author

Hey, this PR is ready for review, is there anything else I should be doing?

@@ -775,7 +775,7 @@ cpdef ndarray[object] ensure_string_array(

result = np.asarray(arr, dtype="object")

if copy and result is arr:
if copy and (result is arr or np.may_share_memory(arr, result)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for np.may_share_memory vs np.shares_memory? can you add a # GH#54654 below this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs may_share_memory does only bounds check for the array, but may return false positives in general.
shares_memory is more computationally expensive.
In this case, asarray should return the same underlying array so bounds checking should be enough, but the worst case would be unnecessary copy in edge-cases instead of more expensive check in common cases
Although now that I'm thinking about it, I'm not sure how the internal shares_memory works.
If it first does bound check and then exact check it might work just the same, and there would be no reason not to use it instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but in case they do have overlapping bounds the check can be exponentially big, so I think it makes sense for this case to stop at bound check only (since asaaray probably returned the same memory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the pickle / unpickle have to change things here? I understand this fixes the issue but its not clear to me why it is required in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Will,
The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly - result is arr
See numpy/numpy#24478 for more technical details.
BTW - it seems the resolution for this issue will be to enhance the docs to state that checking identity on two arrays is not a proper way to verify they don't share memory.

Also, I see there is a merge conflict now, so I'll fix it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for the background info. So shouldn't we be removing the result is arr check altogether as part of this?

As far as the shares_memory versus may_share_memory disccusion, is the main difference when taking a view of an array that reinterprets bytes? I'd probably just prefer shares_memory to start knowing that it is more correct at the cost of more performance; performance can always be improved later (as long as it doesn't introduce a serious regression here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left result is arr since in the most common case asarray just return the same array without modification, so identity check is a pretty cheap validation
The reason I prefered may_share_memory is due to numpy warning: 'If in doubt, use numpy.may_share_memory instead.'
Let me state my reasoning:

  • asarray should either copy or return the same array/memory
  • If the memory is in the same location - it was not copied. bounds check should suffice here since there shouldn't be any views of the array created in this function
  • copy should be less expensive than exhaust check of the shared memory

regarding regression - I'm not sure Pandas has enough tests in place to check the performance regression in this case since it's a pretty unique use case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense but we use shares_memory throughout our codebase where may_share_memory is little used; if that's something we wanted to change I think it is outside the scope of this PR (may be worth doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable, I'll change it to shares_memory

add gh comment
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 20, 2023
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Itayazolay
Copy link
Contributor Author

@WillAyd Anything else needed for merging?

@mroeschke mroeschke added this to the 2.2 milestone Oct 29, 2023
@mroeschke mroeschke merged commit a39f783 into pandas-dev:main Oct 29, 2023
33 checks passed
@mroeschke
Copy link
Member

Thanks @Itayazolay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ensure_string_array may change the array inplace
5 participants