-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix astype str issue 54654 #54687
BUG: Fix astype str issue 54654 #54687
Conversation
on pickle roundtrip astype(str) might change original array even when copy is True
Hey, this PR is ready for review, is there anything else I should be doing? |
pandas/_libs/lib.pyx
Outdated
@@ -775,7 +775,7 @@ cpdef ndarray[object] ensure_string_array( | |||
|
|||
result = np.asarray(arr, dtype="object") | |||
|
|||
if copy and result is arr: | |||
if copy and (result is arr or np.may_share_memory(arr, result)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for np.may_share_memory vs np.shares_memory? can you add a # GH#54654
below this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs may_share_memory
does only bounds check for the array, but may return false positives in general.
shares_memory
is more computationally expensive.
In this case, asarray
should return the same underlying array so bounds checking should be enough, but the worst case would be unnecessary copy in edge-cases instead of more expensive check in common cases
Although now that I'm thinking about it, I'm not sure how the internal shares_memory
works.
If it first does bound check and then exact check it might work just the same, and there would be no reason not to use it instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, but in case they do have overlapping bounds the check can be exponentially big, so I think it makes sense for this case to stop at bound check only (since asaaray probably returned the same memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the pickle / unpickle have to change things here? I understand this fixes the issue but its not clear to me why it is required in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Will,
The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly - result is arr
See numpy/numpy#24478 for more technical details.
BTW - it seems the resolution for this issue will be to enhance the docs to state that checking identity on two arrays is not a proper way to verify they don't share memory.
Also, I see there is a merge conflict now, so I'll fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for the background info. So shouldn't we be removing the result is arr
check altogether as part of this?
As far as the shares_memory
versus may_share_memory
disccusion, is the main difference when taking a view of an array that reinterprets bytes? I'd probably just prefer shares_memory
to start knowing that it is more correct at the cost of more performance; performance can always be improved later (as long as it doesn't introduce a serious regression here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left result is arr
since in the most common case asarray just return the same array without modification, so identity check is a pretty cheap validation
The reason I prefered may_share_memory
is due to numpy
warning: 'If in doubt, use numpy.may_share_memory
instead.'
Let me state my reasoning:
asarray
should either copy or return the same array/memory- If the memory is in the same location - it was not copied. bounds check should suffice here since there shouldn't be any views of the array created in this function
- copy should be less expensive than exhaust check of the shared memory
regarding regression - I'm not sure Pandas has enough tests in place to check the performance regression in this case since it's a pretty unique use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that makes sense but we use shares_memory
throughout our codebase where may_share_memory
is little used; if that's something we wanted to change I think it is outside the scope of this PR (may be worth doing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable, I'll change it to shares_memory
add gh comment
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@WillAyd Anything else needed for merging? |
Thanks @Itayazolay |
doc/source/whatsnew/v2.2.0.rst
file if fixing a bug or adding a new feature.