-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept for Copy-on-Write implementation #41878
Proof of concept for Copy-on-Write implementation #41878
Conversation
@@ -4930,7 +4930,7 @@ def rename( | |||
index: Renamer | None = None, | |||
columns: Renamer | None = None, | |||
axis: Axis | None = None, | |||
copy: bool = True, | |||
copy: bool | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy=None
is what changes DataFrame.rename
to use a shallow copy with CoW instead of doing a full copy for the ArrayManager.
This is eventually passed to self.copy(deep=copy)
, and deep=None
is used to signal a shallow copy for ArrayManager while for now preserving the deep copy for BlockManager.
pandas/core/indexing.py
Outdated
@@ -1869,6 +1869,12 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): | |||
""" | |||
pi = plane_indexer | |||
|
|||
if not getattr(self.obj._mgr, "blocks", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use explicit isinstance; e.g. at first glance empty blocks would go through here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I should have used hasattr
instead of getattr
here (which would avoid that empty block case).
pandas.core.indexing
currently doesn't import from internals, which I suppose is the reason that I used this implicit check instead of an explicit isinstance (having it as a property on the object, which I think was added in some other (open) PR, might also help for this case)
Does this implementation have the shallow-copies issue discussed here #36195 (comment) ? |
Yes or no, depending on how to interpret the question ;) |
pandas/core/indexing.py
Outdated
@@ -1894,6 +1894,16 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): | |||
""" | |||
pi = plane_indexer | |||
|
|||
if not hasattr(self.obj._mgr, "blocks"): | |||
# ArrayManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on why this is AM-specific?
pandas/core/internals/managers.py
Outdated
@@ -351,8 +351,12 @@ def where(self: T, other, cond, align: bool, errors: str) -> T: | |||
errors=errors, | |||
) | |||
|
|||
def setitem(self: T, indexer, value) -> T: | |||
return self.apply("setitem", indexer=indexer, value=value) | |||
def setitem(self: T, indexer, value, inplace=False) -> T: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow the change that was needed in the ArrayManager (to ensure every mutation of the values happens in the manager), see #41879 for this change as a separated pre-cursor PR
Does writing to the parent trigger a copy in the child? Update: i see in the doc it says it does (Propagating Mutation Forwards), but that doesn't appear to be implemented yet in this branch
Not sure if this is a bug or just not yet implemented in this branch. |
Is there any prospect of making this independent of AM/BM? |
It triggers a copy, but in the current implementation it's always the values that are being modified that get copied. So in this case, it would trigger a copy in the parent (but so still preserving the behaviour that mutations are not propagated).
Yes, I didn't yet check the
Yes, I don't think there is anything in the current implementation that makes it tied to the ArrayManager, and it should be possible to write a similar BlockManager version (it could have the logic of checking for weakref references / triggering a copy if needed on the Blocks. I only expect it to be a bit more complicated, at least if we want to ensure that modifying a single column doesn't necessarily triggers a copy of the full dataframe with consolidated blocks). But, before doing this in practice, I would first want to see 1) more discussion on the actual semantics (and some more explicit agreement on that we want this), 2) more review of the implementation details (using weakrefs etc) to ensure this POC is actually possible (as this would be similar for the BM version) and 3) some discussion on how we would see an upgrade path (what kind of version bump, how do we want to provide future warnings etc) |
Thanks, that makes sense. Will have to give some thought to the implications of this.
Make sense. |
How does this behave w/r/t (also |
was the as_mutable_view idea intended to be part of this? IIUC this would mean that basically nothing is ever "just a view"? |
No, that was part of the original proposal / discussion, but I didn't include it in my updated proposal focusing on Copy-on-Write. I also don't directly see how it could be implemented with reliable semantics across all corner cases (or at least without a complex implementation). For example, let's assume you have a
Indeed. Either you have identical objects ( |
if not hasattr(self.obj._mgr, "blocks"): | ||
# ArrayManager: in this case we cannot rely on getting the column | ||
# as a Series to mutate, but need to operated on the mgr directly | ||
if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #44353 (doesn't need to be resolved for this PR, but will make lots of things easier)
@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): | |||
""" | |||
pi = plane_indexer | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#42887 would make for a good precursor
for i in range(len(self.arrays)): | ||
if not self._has_no_reference(i): | ||
# if being referenced -> perform Copy-on-Write and clear the reference | ||
self.arrays[i] = self.arrays[i].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is copying more than is strictly necessary (which may be harmless). to be precise would need to examine the mask
and exclude columns for which not mask[i].any()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one of the reasons I first wanted to refactor putmask here (#44396). But as long as we are using apply_with_block
here, I am not going to bother looking into such optimization (it would require part of the changes in #44396 anyway to split up the mask like that)
(it's also not a super important optimization to have in a first PR I think)
@jbrockmendel thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so a few high level / general comments
- this is leaking ArrayManager into the indexing at a high level (e.g you have to check this in indexing code), ideally this can be avoided
- would like to break off parts of this that are pure refactorings as precursor PRs (as @jbrockmendel suggested)
- you are hijacking the
deep=None
semantics to do this. we really don't like this keyword anyhow, but either can you add a new value, idk, maybedeep='cow'
(not sure this is better) but maybe more obvious and readable. or I think @jbrockmendel suggested an enum (let's make it internal atm to signal this). - this is pretty convoluted when you have a method like
.reset_index()
that only has aninplace
keyword (and notcopy
). i would say we should either: deprecateinplace
and replace withcopy
first. - this is adding an enormous amount of complexitly to the code (likely hiding lots of cases).. not sure what to do about this, just noting it.
@@ -3967,8 +3967,15 @@ def _set_value( | |||
""" | |||
try: | |||
if takeable: | |||
series = self._ixs(col, axis=1) | |||
series._set_value(index, value, takeable=True) | |||
if isinstance(self._mgr, ArrayManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not do this in internals? hate leaking ArrayManager semantics here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I added here is actually to call into the internals to do it, instead of the current frame-level methods.
It's only possible for ArrayManager, though (see https://github.com/pandas-dev/pandas/pull/41878/files#r757450838 just below), which is the reason I added a check.
@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): | |||
""" | |||
pi = plane_indexer | |||
|
|||
if not hasattr(self.obj._mgr, "blocks"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be an ArrayManager test? why the different semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a check to know if it is an ArrayManager, without actually importing it (core/indexing.py currently doesn't import anything from internals).
I think in another PR that might not have been merged, I added an attribute on the DataFrame that returns True/False for this, which might be easier to reuse in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah we should for sure have 1 canonical way of doing this check so we can grep for all the places we do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #44676
Thanks for the review!
I don't think that
What do you mean exactly with "convoluted" in this case? For the case the user doesn't specify
I think it's actually quite little code for such a big feature :) (the majority of the diff here are new/adapated tests). Now there are still many cases not covered, so it will of course still grow a bit. |
@jorisvandenbossche closable? |
Yeah since CoW is in 1.5, albeit only for blockmanger, I think we can close this for now until Arraymanager work is picked back up. Closing |
An experiment to implement one of the proposals discussed in #36195, described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit
This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series when using ArrayManager.
It does this by adding a new
.refs
attribute to theArrayManager
that, if populated, keeps a list ofweakref
references (one per columns, solen(mgr.arrays) == len(mgr.refs)
to the array it is viewing.This ensures that if we are modifying an array of a child manager, we can check if it is referencing (viewing) another array, and if needed do a copy on write. And also if we are modifying an array of a parent manager, we can check if that array is being referenced by another array and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)
A very brief summary of the behaviour you get (longer description at #36195 (comment)):
reset_index
andrename
to test, needs to be expanded to other methods)I added a
test_copy_on_write.py
which already tests a set of cases (which of course needs to be expanded), going through that file should give an idea of the kind of behaviours (and how it changes compared to current master/BlockManager). Link to that file in the diff: click here(I only added new, targeted tests in that file, I didn't yet start updating existing tests, as I imagine there will be quite a lot)
This is a proof-of-concept PR, so the feedback I am looking for / what I want to get out of it:
weakrefs
(it's quite simple at the moment, but not too simple? Will this be robust enough?)And to be clear, not yet everything is covered (there are some
# TODO
's inarray_manager.py
, but only fixing those while also adding tests that require it)TODO:
cc @pandas-dev/pandas-core