Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Merge copy-on-write feature branch into branch-23.04 #12619

Merged
merged 54 commits into from
Feb 16, 2023

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Jan 26, 2023

Description

This PR contains changes from #11718 primarily that will enable Copy on write feature in cudf.

This PR introduces copy-on-write. As the name suggests when copy-on-write is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(c-o-w) can be enabled in two ways:

  1. Setting CUDF_COPY_ON_WRITE environment variable to 1 will enable c-o-w, unsetting will disable c-o-w.
  2. Setting copy_on_write option in cudf options by doing cudf.set_option("copy_on_write", True) to enable it and cudf.set_option("copy_on_write", False) to disable it.

Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in.

A valid performance comparison can be done only with copy_on_write=OFF + .copy(deep=True) vs copy_on_write=ON + .copy(deep=False):

In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000))

# branch-23.02 : 1209MiB
# This-PR : 1209MiB

In [3]: s_copy = s.copy(deep=True) #branch-23.02
In [3]: s_copy = s.copy(deep=False) #This-PR

# branch-23.02 : 1973MiB
# This-PR : 1209MiB

In [4]: s
Out[4]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [5]: s_copy
Out[5]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [6]: s[2] = 10001

# branch-23.02 : 3121MiB
# This-PR : 3121MiB

In [7]: s
Out[7]: 
0                  0
1                  1
2              10001
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [8]: s_copy
Out[8]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

Stats around the performance and memory gains :

  • Memory usage of new copies will be 0 GPU memory additional overhead i.e., users will save 2x, 5x, 10x,...20x memory usage for making 2x, 5x, 10x,...20x deep copies respectively. So, The more you copy the more you save 😉(as long as you don't write on all of them)
  • copying times are now cut by 99% for all dtypes when copy-on-write is enabled(copy_on_write=OFF + .copy(deep=True) vs copy_on_write=ON + .copy(deep=False)).
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a': range(0, 1000000)})

In [3]: df = cudf.DataFrame({'a': range(0, 100000000)})

In [4]: df['b'] = df.a.astype('str')

# GPU memory usage
# branch-23.02 : 2345MiB
# This-PR : 2345MiB

In [5]: df
Out[5]: 
                 a         b
0                0         0
1                1         1
2                2         2
3                3         3
4                4         4
...            ...       ...
99999995  99999995  99999995
99999996  99999996  99999996
99999997  99999997  99999997
99999998  99999998  99999998
99999999  99999999  99999999

[100000000 rows x 2 columns]

In [6]: def make_two_copies(df, deep):
   ...:     return df.copy(deep=deep), df.copy(deep=deep)
   ...: 

In [7]: x, y = make_two_copies(df, deep=True) # branch-23.02
In [7]: x, y = make_two_copies(df, deep=False) # This PR

# GPU memory usage
# branch-23.02 : 6147MiB
# This-PR : 2345MiB

In [8]: %timeit make_two_copies(df, deep=True) # branch-23.02
In [8]: %timeit make_two_copies(df, deep=False) # This PR

# Execution times
# branch-23.02 : 135 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 100 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  • Even when copy-on-write is disabled, string, list & struct columns deep copies are now 99% faster
In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000), dtype='str')

In [3]: %timeit s.copy(deep=True)


# branch-23.02 : 28.3 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This PR : 19.9 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [9]: s = cudf.Series([[1, 2], [2, 3], [3, 4], [4, 5], [6, 7]]* 10000000)

In [10]: %timeit s.copy(deep=True)
# branch-23.02 : 25.7 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 44.2 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [4]: df = cudf.DataFrame({'a': range(0, 100000000), 'b': range(0, 100000000)[::-1]})

In [5]: s = df.to_struct()

In [6]: %timeit s.copy(deep=True)

# branch-23.02 : 42.5 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 89.7 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  • Add pytests
  • Docs page explaining copy on write and how to enable/disable it.

galipremsagar and others added 6 commits January 13, 2023 16:12
Initial copy-on-write implementation
* get_ptr & _array_view refactor

* Apply suggestions from code review

Co-authored-by: Lawrence Mitchell <[email protected]>

* address reviews

* Apply suggestions from code review

Co-authored-by: Mads R. B. Kristensen <[email protected]>

* drop internal_write

* add docstring

* Apply suggestions from code review

Co-authored-by: Lawrence Mitchell <[email protected]>

* address reviews

* address reviews

* add locks

* Apply suggestions from code review

Co-authored-by: Mads R. B. Kristensen <[email protected]>

* Apply suggestions from code review

Co-authored-by: Vyas Ramasubramani <[email protected]>

* make mode a required key-arg

* rename to _readonly_proxy_cai_obj

* Update python/cudf/cudf/core/column/column.py

Co-authored-by: Lawrence Mitchell <[email protected]>

* revert

* fix

* Apply suggestions from code review

Co-authored-by: Lawrence Mitchell <[email protected]>
Co-authored-by: Mads R. B. Kristensen <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
@galipremsagar galipremsagar requested a review from a team as a code owner January 26, 2023 15:47
@galipremsagar galipremsagar requested review from wence- and vyasr January 26, 2023 15:47
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 26, 2023
@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 26, 2023
python/cudf/cudf/options.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@e4ffcbb). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12619   +/-   ##
===============================================
  Coverage                ?   85.85%           
===============================================
  Files                   ?      159           
  Lines                   ?    25329           
  Branches                ?        0           
===============================================
  Hits                    ?    21745           
  Misses                  ?     3584           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@randerzander
Copy link
Contributor

benchmark bot, please test this PR

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we copy the PR description from #11718 since this is now the PR that will be merged and show up in the changelog? Of course, in case any updates are needed given the various changes we have made since that PR, please make those changes in this description.

This review focused exclusively on the docs (IMO very important to describe COW well in our documentation). Will follow up with a code review.

docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK done reviewing code as well. This is very close!

python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/cow_buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Show resolved Hide resolved
python/cudf/cudf/options.py Outdated Show resolved Hide resolved
python/cudf/cudf/options.py Outdated Show resolved Hide resolved
python/cudf/cudf/options.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/applyutils.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_copying.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comments.

docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved

```bash
export CUDF_COPY_ON_WRITE="1" python -c "import cudf"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I switch between CoW on and CoW off in the same run, or is this a one-time option I should set before I actually create and manipulate any cudf options? I suspect it is the latter. If so, we should call that out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former. Pandas allows that too:

In [1]: import pandas as pd

In [2]: pd.options.mode.copy_on_write = True

In [3]: s = pd.Series([1, 2, 1, 2])

In [4]: pd.options.mode.copy_on_write = True

In [5]: pd.options.mode.copy_on_write = False

In [6]: s.head(2)
Out[6]: 
0    1
1    2
dtype: int64

Since all of our buffer creation calls go through, the as_buffer constructor we are able to support COW on and off in the same run.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor code comments, looking good!

python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/cow_buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/cow_buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/cow_buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Show resolved Hide resolved
python/cudf/cudf/core/multiindex.py Show resolved Hide resolved
python/cudf/cudf/options.py Outdated Show resolved Hide resolved
python/cudf/cudf/testing/_utils.py Show resolved Hide resolved
@galipremsagar galipremsagar requested a review from vyasr February 13, 2023 15:28
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really good and am happy to hit go. Can you just check that all the comments are resolved before merge? (I provided a few very minor ones here).

docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Show resolved Hide resolved
python/cudf/cudf/core/column/lists.py Show resolved Hide resolved
python/cudf/cudf/testing/_utils.py Show resolved Hide resolved
python/cudf/cudf/tests/test_series.py Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor comments, otherwise LGTM!

docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
@property
def __cuda_array_interface__(self) -> dict:
# Unlink if there are any weak references.
self._unlink_shared_buffers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galipremsagar unless my GH view isn't updated properly, it looks like the comments were merged but the unlink call is still happening above the comment. Not sure if that was intentional, I would still recommend moving that.

python/cudf/cudf/core/column/column.py Show resolved Hide resolved
python/cudf/cudf/tests/test_copying.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_copying.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_copying.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_copying.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 16, 2023
@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 506a479 into branch-23.04 Feb 16, 2023
@wence- wence- deleted the copy-on-write branch March 1, 2023 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants