Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Merge copy-on-write feature branch into branch-23.04 #12619

Merged
merged 54 commits into from
Feb 16, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
bd72a17
[REVIEW] Copy on write implementation (#11718)
galipremsagar Jan 13, 2023
2f529d3
Merge remote-tracking branch 'upstream/branch-23.02' into HEAD
galipremsagar Jan 18, 2023
bd6933f
merge
galipremsagar Jan 19, 2023
1a758e6
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 24, 2023
fa094ed
merge
galipremsagar Jan 26, 2023
a857ad9
[REVIEW] Update `copy-on-write` with `branch-23.02` changes (#12556)
galipremsagar Jan 26, 2023
173e749
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 26, 2023
5b54519
update docs
galipremsagar Jan 26, 2023
d46c3a8
simplify docstring
galipremsagar Jan 26, 2023
ae95e48
revert redundant changes
galipremsagar Jan 26, 2023
ee78be8
Apply suggestions from code review
galipremsagar Jan 27, 2023
a02a150
Apply suggestions from code review
galipremsagar Jan 27, 2023
7c84343
style
galipremsagar Jan 27, 2023
1afd167
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 27, 2023
d3fd0f3
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 28, 2023
527a61f
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 28, 2023
b69c34d
address reviews in code
galipremsagar Jan 30, 2023
761c328
Merge remote-tracking branch 'upstream/branch-23.02' into copy-on-write
galipremsagar Jan 30, 2023
6717caf
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Jan 30, 2023
60d009f
address reviews in docs
galipremsagar Jan 31, 2023
39e59c9
add coverage
galipremsagar Jan 31, 2023
be011b2
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Jan 31, 2023
0cdec03
cleanup after runs
galipremsagar Jan 31, 2023
9791014
Merge branch 'branch-23.04' into copy-on-write
galipremsagar Feb 3, 2023
7cd8150
Apply suggestions from code review
galipremsagar Feb 6, 2023
dde8d3a
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Feb 6, 2023
2ac5d43
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Feb 8, 2023
f474e79
Update docs/cudf/source/user_guide/copy-on-write.md
galipremsagar Feb 8, 2023
8506339
Merge remote-tracking branch 'upstream/copy-on-write' into copy-on-write
galipremsagar Feb 8, 2023
dd1f1fa
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Feb 8, 2023
5a8ad61
removed advantages title
galipremsagar Feb 8, 2023
417883b
address reviews
galipremsagar Feb 8, 2023
5470ee9
Merge branch 'branch-23.04' into copy-on-write
galipremsagar Feb 8, 2023
99967fc
Apply suggestions from code review
galipremsagar Feb 10, 2023
45e1fd1
address reviews
galipremsagar Feb 10, 2023
9077049
Merge branch 'copy-on-write' of https://github.com/rapidsai/cudf into…
galipremsagar Feb 10, 2023
8e7240c
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Feb 11, 2023
2e9eac7
add comments
galipremsagar Feb 11, 2023
26b1d87
address reviews
galipremsagar Feb 13, 2023
a13040a
Merge remote-tracking branch 'upstream/branch-23.04' into copy-on-write
galipremsagar Feb 13, 2023
36ecf22
drop _readonly_proxy_cai_obj
galipremsagar Feb 13, 2023
c3977b7
cleanup
galipremsagar Feb 13, 2023
33f6d3b
Update python/cudf/cudf/core/buffer/cow_buffer.py
galipremsagar Feb 13, 2023
0412db6
Update python/cudf/cudf/tests/test_copying.py
galipremsagar Feb 13, 2023
a74bb48
Update python/cudf/cudf/tests/test_copying.py
galipremsagar Feb 13, 2023
1a2ec7b
Update python/cudf/cudf/tests/test_copying.py
galipremsagar Feb 13, 2023
3c5ac1a
use contextmanager
galipremsagar Feb 13, 2023
4ba71d3
add comment
galipremsagar Feb 13, 2023
6edd003
add comment
galipremsagar Feb 13, 2023
96401fa
Merge branch 'branch-23.04' into copy-on-write
galipremsagar Feb 13, 2023
dbaf0c3
Merge branch 'branch-23.04' into copy-on-write
galipremsagar Feb 16, 2023
518ea66
update docs
galipremsagar Feb 16, 2023
5ed99ad
typo
galipremsagar Feb 16, 2023
6a96647
Merge branch 'branch-23.04' into copy-on-write
galipremsagar Feb 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/cudf/source/developer_guide/library_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,3 +314,180 @@ The pandas API also includes a number of helper objects, such as `GroupBy`, `Rol
cuDF implements corresponding objects with the same APIs.
Internally, these objects typically interact with cuDF objects at the Frame layer via composition.
However, for performance reasons they frequently access internal attributes and methods of `Frame` and its subclasses.


## Copy-on-write


Copy-on-write (COW) is designed to reduce memory footprint on GPUs. With this feature, a copy (`.copy(deep=False)`) is only really made whenever
there is a write operation on a column. It is first recommended to see
the public usage [here](copy-on-write-user-doc) of this functionality before reading through the internals
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
below.

The core copy-on-write implementation relies on the `CopyOnWriteBuffer` class. This class stores the pointer to the device memory and size.
With the help of `CopyOnWriteBuffer._ptr` we generate [weak references](https://docs.python.org/3/library/weakref.html) of `CopyOnWriteBuffer` and store it in `CopyOnWriteBuffer._instances`.
This is a mapping from `ptr` keys to `WeakSet`s containing references to `CopyOnWriterBuffer` objects. This
means all the new `CopyOnWriteBuffer`s that are created map to the same key in `CopyOnWriteBuffer._instances` if they have same `._ptr`
i.e., if they are all pointing to the same device memory.

When the cudf option `"copy_on_write"` is `True`, `as_buffer` will always return a `CopyOnWriteBuffer`. This class contains all the
mechanisms to enable copy-on-write for all buffers. When a `CopyOnWriteBuffer` is created, its weakref is generated and added to the `WeakSet` which is in turn stored in `CopyOnWriterBuffer._instances`. This will later serve as an indication of whether or not to make a copy when a
when write operation is performed on a `Column` (see below).
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved


### Eager copies when exposing to third-party libraries

If `Column`/`CopyOnWriteBuffer` is exposed to a third-party library via `__cuda_array_interface__`, we are no longer able to track whether or not modification of the buffer has occurred without introspection. Hence whenever
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
someone accesses data through the `__cuda_array_interface__`, we eagerly trigger the copy by calling
`_unlink_shared_buffers` which ensures a true copy of underlying device data is made and
unlinks the buffer from any shared "weak" references. Any future shallow-copy requests must also trigger a true physical copy (since we cannot track the lifetime of the third-party object), to handle this we also mark the `Column`/`CopyOnWriteBuffer` as
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
`obj._zero_copied=True` thus indicating any future shallow-copy requests will trigger a true physical copy
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
rather than a copy-on-write shallow copy with weak references.

### How to obtain read-only object?
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

A read-only object can be quite useful for operations that will not
mutate the data. This can be achieved by calling `._readonly_proxy_cai_obj`
API, this API will return a proxy object that has `__cuda_array_interface__`
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
implemented and will not trigger a deep copy even if the `CopyOnWriteBuffer`
has weak references. It is only recommended to use this API as long as
the objects/arrays created with this proxy object gets cleaned up during
the developer code execution. We currently use this API for device to host
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
copies like in `ColumnBase.data_array_view(mode="read")` which is used for `Column.values_host`.

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Notes:
1. Weak references are implemented only for fixed-width data types as these are only column
types that can be mutated in place.
2. Deep copies of variable width data types return shallow-copies of the Columns, because these
types don't support real in-place mutations to the data. We just mimic in such a way that it looks
like an in-place operation using `_mimic_inplace`.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved


### Examples

When copy-on-write is enabled, taking a shallow copy of a `Series` or a `DataFrame` does not
eagerly create a copy of the data. Instead, it produces a view that will be lazily
copied when a write operation is performed on any of its copies.

Let's create a series:

```python
>>> import cudf
>>> cudf.set_option("copy_on_write", True)
>>> s1 = cudf.Series([1, 2, 3, 4])
```

Make a copy of `s1`:
```python
>>> s2 = s1.copy(deep=False)
```

Make another copy, but of `s2`:
```python
>>> s3 = s2.copy(deep=False)
```

Viewing the data and memory addresses show that they all point to the same device memory:
```python
>>> s1
0 1
1 2
2 3
3 4
dtype: int64
>>> s2
0 1
1 2
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64

>>> s1.data._ptr
139796315897856
>>> s2.data._ptr
139796315897856
>>> s3.data._ptr
139796315897856
```

Now, when we perform a write operation on one of them, say on `s2`, a new copy is created
for `s2` on device and then modified:

```python
>>> s2[0:2] = 10
>>> s2
0 10
1 10
2 3
3 4
dtype: int64
>>> s1
0 1
1 2
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64
```

If we inspect the memory address of the data, `s1` and `s3` still share the same address but `s2` has a new one:

```python
>>> s1.data._ptr
139796315897856
>>> s3.data._ptr
139796315897856
>>> s2.data._ptr
139796315899392
```

Now, performing write operation on `s1` will trigger a new copy on device memory as there
is a weak reference being shared in `s3`:

```python
>>> s1[0:2] = 11
>>> s1
0 11
1 11
2 3
3 4
dtype: int64
>>> s2
0 10
1 10
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64
```

If we inspect the memory address of the data, the addresses of `s2` and `s3` remain unchanged, but `s1`'s memory address has changed because of a copy operation performed during the writing:

```python
>>> s2.data._ptr
139796315899392
>>> s3.data._ptr
139796315897856
>>> s1.data._ptr
139796315879723
```

cudf Copy-on-write implementation is motivated by pandas Copy-on-write proposal here:
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
1. [Google doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u)
2. [Github issue](https://github.com/pandas-dev/pandas/issues/36195)
169 changes: 169 additions & 0 deletions docs/cudf/source/user_guide/copy-on-write.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
(copy-on-write-user-doc)=

# Copy-on-write

Copy-on-write reduces GPU memory usage when copies(`.copy(deep=False)`) of a column
are made.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

| | Copy-on-Write enabled | Copy-on-Write disabled (default) |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `.copy(deep=True)` | A true copy is made and changes don't propagate to the original object. | A true copy is made and changes don't propagate to the original object. |
| `.copy(deep=False)` | Memory is shared between the two objects and but any write operation on one object will trigger a true physical copy before the write is performed. Hence changes will not propagate to the original object. | Memory is shared between the two objects and changes performed on one will propagate to the other object. |
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
## How to enable it
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

i. Use `cudf.set_option`:
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

```python
>>> import cudf
>>> cudf.set_option("copy_on_write", True)
```

ii. Set the environment variable ``CUDF_COPY_ON_WRITE`` to ``1`` prior to the
launch of the Python interpreter:

```bash
export CUDF_COPY_ON_WRITE="1" python -c "import cudf"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I switch between CoW on and CoW off in the same run, or is this a one-time option I should set before I actually create and manipulate any cudf options? I suspect it is the latter. If so, we should call that out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former. Pandas allows that too:

In [1]: import pandas as pd

In [2]: pd.options.mode.copy_on_write = True

In [3]: s = pd.Series([1, 2, 1, 2])

In [4]: pd.options.mode.copy_on_write = True

In [5]: pd.options.mode.copy_on_write = False

In [6]: s.head(2)
Out[6]: 
0    1
1    2
dtype: int64

Since all of our buffer creation calls go through, the as_buffer constructor we are able to support COW on and off in the same run.



## Making copies

There are no additional changes required in the code to make use of copy-on-write.

```python
>>> series = cudf.Series([1, 2, 3, 4])
```

Performing a shallow copy will create a new Series object pointing to the
same underlying device memory:

```python
>>> copied_series = series.copy(deep=False)
>>> series
0 1
1 2
2 3
3 4
dtype: int64
>>> copied_series
0 1
1 2
2 3
3 4
dtype: int64
```

When a write operation is performed on either ``series`` or
``copied_series``, a true physical copy of the data is created:

```python
>>> series[0:2] = 10
>>> series
0 10
1 10
2 3
3 4
dtype: int64
>>> copied_series
0 1
1 2
2 3
3 4
dtype: int64
```


## Notes

When copy-on-write is enabled, there is no concept of views. i.e., modifying any view created inside cudf will not actually not modify
the original object it was viewing and thus a separate copy is created and then modified.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

## Advantages

1. With the concept of views going away, every object is a copy of it's original object. This will bring consistency across operations and cudf closer to parity with
pandas. Following is one of the inconsistency:
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

```python

>>> import pandas as pd
>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 10
1 2
2 3
3 4
4 5
dtype: int64

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64
```

The above inconsistency is solved when Copy-on-write is enabled:
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

```python
>>> import pandas as pd
>>> pd.set_option("mode.copy_on_write", True)
>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64


>>> import cudf
>>> cudf.set_option("copy_on_write", True)
wence- marked this conversation as resolved.
Show resolved Hide resolved
>>> s = cudf.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64
```
2. There are numerous other inconsistencies, which are solved by copy-on-write. Read more about them [here](https://phofl.github.io/cow-introduction.html).
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved


## How to disable it


Copy-on-write can be disable by setting ``copy_on_write`` cudf option to ``False``:

```python
>>> cudf.set_option("copy_on_write", False)
```
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions docs/cudf/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ guide-to-udfs
cupy-interop
options
PandasCompat
copy-on-write
```
Loading