-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Merge copy-on-write
feature branch into branch-23.04
#12619
Conversation
Initial copy-on-write implementation
* get_ptr & _array_view refactor * Apply suggestions from code review Co-authored-by: Lawrence Mitchell <[email protected]> * address reviews * Apply suggestions from code review Co-authored-by: Mads R. B. Kristensen <[email protected]> * drop internal_write * add docstring * Apply suggestions from code review Co-authored-by: Lawrence Mitchell <[email protected]> * address reviews * address reviews * add locks * Apply suggestions from code review Co-authored-by: Mads R. B. Kristensen <[email protected]> * Apply suggestions from code review Co-authored-by: Vyas Ramasubramani <[email protected]> * make mode a required key-arg * rename to _readonly_proxy_cai_obj * Update python/cudf/cudf/core/column/column.py Co-authored-by: Lawrence Mitchell <[email protected]> * revert * fix * Apply suggestions from code review Co-authored-by: Lawrence Mitchell <[email protected]> Co-authored-by: Mads R. B. Kristensen <[email protected]> Co-authored-by: Vyas Ramasubramani <[email protected]>
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12619 +/- ##
===============================================
Coverage ? 85.85%
===============================================
Files ? 159
Lines ? 25329
Branches ? 0
===============================================
Hits ? 21745
Misses ? 3584
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
benchmark bot, please test this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we copy the PR description from #11718 since this is now the PR that will be merged and show up in the changelog? Of course, in case any updates are needed given the various changes we have made since that PR, please make those changes in this description.
This review focused exclusively on the docs (IMO very important to describe COW well in our documentation). Will follow up with a code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK done reviewing code as well. This is very close!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc comments.
|
||
```bash | ||
export CUDF_COPY_ON_WRITE="1" python -c "import cudf" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I switch between CoW on and CoW off in the same run, or is this a one-time option I should set before I actually create and manipulate any cudf options? I suspect it is the latter. If so, we should call that out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the former. Pandas allows that too:
In [1]: import pandas as pd
In [2]: pd.options.mode.copy_on_write = True
In [3]: s = pd.Series([1, 2, 1, 2])
In [4]: pd.options.mode.copy_on_write = True
In [5]: pd.options.mode.copy_on_write = False
In [6]: s.head(2)
Out[6]:
0 1
1 2
dtype: int64
Since all of our buffer creation calls go through, the as_buffer
constructor we are able to support COW on and off in the same run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor code comments, looking good!
Co-authored-by: Vyas Ramasubramani <[email protected]> Co-authored-by: Lawrence Mitchell <[email protected]>
Co-authored-by: Lawrence Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is really good and am happy to hit go. Can you just check that all the comments are resolved before merge? (I provided a few very minor ones here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor comments, otherwise LGTM!
@property | ||
def __cuda_array_interface__(self) -> dict: | ||
# Unlink if there are any weak references. | ||
self._unlink_shared_buffers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galipremsagar unless my GH view isn't updated properly, it looks like the comments were merged but the unlink call is still happening above the comment. Not sure if that was intentional, I would still recommend moving that.
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
/merge |
Description
This PR contains changes from #11718 primarily that will enable Copy on write feature in cudf.
This PR introduces
copy-on-write
. As the name suggests whencopy-on-write
is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(c-o-w
) can be enabled in two ways:CUDF_COPY_ON_WRITE
environment variable to1
will enablec-o-w
, unsetting will disablec-o-w
.copy_on_write
option incudf
options by doingcudf.set_option("copy_on_write", True)
to enable it andcudf.set_option("copy_on_write", False)
to disable it.Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in.
A valid performance comparison can be done only with
copy_on_write=OFF
+.copy(deep=True)
vscopy_on_write=ON
+.copy(deep=False)
:Stats around the performance and memory gains :
copy_on_write=OFF
+.copy(deep=True)
vscopy_on_write=ON
+.copy(deep=False)
).copy-on-write
is disabled,string
,list
&struct
columns deep copies are now 99% faster