-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: New copy / view semantics using Copy-on-Write #46958
API: New copy / view semantics using Copy-on-Write #46958
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
for blk in self.blocks: | ||
nb = blk.getitem_block_index(slobj) | ||
nbs.append(nb) | ||
nrefs.append(weakref.ref(blk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason to make the reference to the blk instead of its array blk.values? will that make a difference in the cases where blk.values is re-set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I don't remember if there was a specific technical reason to do so, but I think it seemed the easier option (since this is all handled at the BlockManager level). When keeping those references / when checking for references, it would otherwise be an additional level of indirection to check the blk.values instead of blk itself.
I suppose this would generally be the same, but indeed except in places where blk.values
is re-set in place. I actually added one such case in this PR in Block.set_values
(which is called in BlockManager.iset
). In that case I should probably rather copy the Block and replace the block in the BM with the copied block, instead of copying the values in the existing block.
Do you know of other places where we currently we re-set blk.values?
@jorisvandenbossche im questioning the accuracy of the "nothing is 'just a view'" heuristic. Do any of the following cases break that:
|
Do you mean in the current state of the PR? Or that those cases in general should/would break that in your opinion? The goal is that none of those examples would break that general rule. But at the moment, this PR only added the CoW mechanism to indexing and explicitly to a few methods (
The other ones (astype, transpose) are good cases I still need to test and/or implement. |
Just in general, wanted to make sure that the question was written down in one of the relevant issues/PRs. Thanks for taking a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you planning on some docs + whatsnew in a separate PR?
There is one new test from a few days ago that still needs to be updated for CoW (the one failing build), will do that tomorrow.
Yes, would prefer to do that in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good too me, especially based on the testing. Would you still like to include this for 1.5?
Yes, I would prefer that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
are there bechmarks on CoW for typical operations? eg pick a chain of ops that are now view returning and compare vs existing
what is the magnitude of the change?
Thanks @jorisvandenbossche! Maybe could reference or document the benchmarks in the follow-up documentation PR |
way to go! |
* API: New copy / view semantics using Copy-on-Write * fix more tests * Handle CoW in BM.iset * Handle CoW in xs * add bunch of todo comments and usage warnings * Insert None ref in BM.insert * Ensure to not assume copy_on_write is set in case of ArrayManager * Handle refs in BM._combine / test CoW in select_dtypes * handle get_numeric_data for single block manager * fix test_internals (get_numeric_data now uses CoW) * handle refs in consolidation * fix deep=None for ArrayManager * update copy/view tests from other PR * clean-up fast_xs workarounds now it returns a SingleBlockManager * tracks refs in to_frame * fixup after updata main and column_setitem + iloc inplace setitem changes (pandas-devgh-45333) * fix inplace fillna + fixup new tests * address comments + update some todo comments * Update pandas/core/internals/managers.py Co-authored-by: Matthew Roeschke <[email protected]> * fixup linting * update new copy_view tests to use get_array helper * add comment to setitem * switch default to False, ensure CoW copies only happen when enabled + add additional test build with CoW * update type annotations * Fix stata issue to avoid SettingWithCopyWarning in read_stata * update type + option comment * fixup new rename test Co-authored-by: Matthew Roeschke <[email protected]>
* API: New copy / view semantics using Copy-on-Write * fix more tests * Handle CoW in BM.iset * Handle CoW in xs * add bunch of todo comments and usage warnings * Insert None ref in BM.insert * Ensure to not assume copy_on_write is set in case of ArrayManager * Handle refs in BM._combine / test CoW in select_dtypes * handle get_numeric_data for single block manager * fix test_internals (get_numeric_data now uses CoW) * handle refs in consolidation * fix deep=None for ArrayManager * update copy/view tests from other PR * clean-up fast_xs workarounds now it returns a SingleBlockManager * tracks refs in to_frame * fixup after updata main and column_setitem + iloc inplace setitem changes (pandas-devgh-45333) * fix inplace fillna + fixup new tests * address comments + update some todo comments * Update pandas/core/internals/managers.py Co-authored-by: Matthew Roeschke <[email protected]> * fixup linting * update new copy_view tests to use get_array helper * add comment to setitem * switch default to False, ensure CoW copies only happen when enabled + add additional test build with CoW * update type annotations * Fix stata issue to avoid SettingWithCopyWarning in read_stata * update type + option comment * fixup new rename test Co-authored-by: Matthew Roeschke <[email protected]>
This is a port of the proof of concept using the ArrayManager in #41878 to the default BlockManager.
This PR is a start to implement the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195
A very brief summary of the behaviour you get:
reset_index
andrename
, needs to be expanded to other methods)Implementation approach
This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series at the BlockManager level. It does this by adding a new
.refs
attribute to theBlockManager
that, if populated, keeps a list ofweakref
references to the blocks it shares data with (so for the BlockManager, this reference tracking is done per block, solen(mgr.blocks) == len(mgr.refs)
).This ensures that if we are modifying a block of a child manager, we can check if it is referencing (viewing) another block, and if needed do a copy on write. And also if we are modifying a block of a parent manager, we can check if that block is being referenced by another manager and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)
How to enable this new behaviour?
Currently this PR simply enabled the new behaviour with CoW, but of course that will need to be turned off before merging (which also means that some of the changes will need to put behind a feature flag. I only did that now in some places).
I think that ideally, (on the short term) users have a way to enable the future behaviour (eg using an option), but also have a way to enable additional warnings.
I already started adding an option, currently the boolean flag
options.mode.copy_on_write=True|False
:Some notes:
TODO(CoW)
in the code), although the majority for indexing / setitem is done..values
). Given the size of this PR already, those can probably be done in separate PRs?I will also pull out some of the changes in separate PRs (eg the new test file could already be discussed/reviewed separately (-> #46979), and the
column_setitem
is maybe also something that could be done as pre-cursor(-> #47074))