-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) #13307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass, haven't fully managed to go through all the logic yet
|
||
`TenableBuffer` is a subclass of the regular `Buffer` that tracks its "expose" status of its underlying memory. We say that the buffer has been exposed if the device pointer (integer or void*) has been accessed outside of cudf, in which case we have no control over knowing if the data is being modified by a third-party. Additionally, `TenableBuffer` also maintains [weak references](https://docs.python.org/3/library/weakref.html) to every existing `BufferSlice` that points to its underlying memory. | ||
|
||
`BufferSlice` is a subclass of `TenableBuffer` that represents a _slice_ of the memory underlying a tenable buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is subclassing the right thing? (I haven't yet read the implementation). It seems like every BufferSlice
has-a
TenableBuffer
, but that doesn't necessarily imply an is-a
relationship.
If it really is a subclass, do we actually need both classes, or can there just be TenableBuffer
objects that either own data, or are views of existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up PR, SpillableBuffer
will inherent from TenableBuffer
and BufferSlice._base
will point to SpillableBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better if BufferSlice
inherent from Buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is similar to the column
vs. column_view
idea in libcudf. A Buffer
(TenableBuffer
, SpillableBuffer
) is the concrete owning object, and then the BufferSlice
is a non-owning view? Or how does the ownership work? Does a BufferSlice
own the Buffer
it slices (or share ownership with multiple slices?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A buffer Protocol might make sense, but then I think we should do it in a follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could have Buffer
use BufferSlice
so that BufferSlice
is the only buffer object used in the rest of cudf.
In any case, I think we should wait until SpillableBuffer
also uses BufferSlice
and TenableBuffer
so we have a better picture of the exact use cases.
Co-authored-by: Lawrence Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small change requests, but in general this looks like the right direction. Thanks @madsbk! I can see from this PR how you would integrate spilling, but also appreciate you splitting up the work this way to make the changes incrementally.
if exposed: | ||
raise ValueError("cannot created exposed host memory") | ||
return cast( | ||
BufferSlice, ExposureTrackedBuffer._from_host_memory(data)[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the eventual plan for Buffer.getitem to return a BufferSlice
? If not, it might be cleaner to override the method in ExposureTrackedBuffer
. I know the whole point of the _getitem
/__getitem__
split is to help share some functionality, but the typing confusion here indicates that there are potentially incorrect types that could result from that approach (obviously we can coerce the code into behaving correctly, but it makes it much harder to write intrinsically type-safe code if the type annotations aren't sufficiently valid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is a very valid point!
I think the clean design is to make cudf always work on BufferSlice
even when COW and spilling is disabled. Then we get a clean class hierarchy:
- COW & Spilling disable:
BufferSlice -> Buffer
- COW enabled:
BufferSlice -> ExposureTrackedBuffer -> Buffer
- Spilling enabled (when is has been unified with COW in a follow-up PR):
BufferSlice -> SpillableBuffer -> ExposureTrackedBuffer -> Buffer
The downside is that this approach is a bit more intrusive in the default case where COW and spilling is disabled. I think it is worth it but what do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is probably worth it. I would stage that as work to be done after the COW and spilling unification is complete and we can reevaluate in the context of a cleaner architecture.
Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor typo fix, thanks!
"shape": (self.size,), | ||
"strides": None, | ||
"typestr": "|u1", | ||
"version": 0, | ||
} | ||
|
||
def get_ptr(self, *, mode) -> int: | ||
def get_ptr(self, *, mode: Literal["read", "write"]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change this, but just to note that Literal
doesn't get handled very well by type-checkers if the argument comes from a variable, rather than a literal value (unless the variable is marked with : Final
). Since they don't do dataflow analysis
https://mypy-play.net/?mypy=latest&python=3.11&gist=6541ee12e80daeb4b0837563d98dc442
Co-authored-by: Lawrence Mitchell <[email protected]>
/merge |
Thanks all for the reviews |
…13801) This PR de-couples buffer slices/views from owning buffers. As it is now, all buffer classes (`ExposureTrackedBuffer`, `BufferSlice`, `SpillableBuffer`, `SpillableBufferSlice`) inherent from `Buffer`, however they are not Liskov substitutable as pointed by @wence- and @vyasr ([here](#13307 (comment)) and [here](#13307 (comment))). To fix this, we now have a `Buffer` and a `BufferOwner` class. We still use the `Buffer` throughout cuDF but it now points to an `BufferOwner`. We have the following class hierarchy: ``` ExposureTrackedBufferOwner -> BufferOwner SpillableBufferOwner -> BufferOwner ExposureTrackedBuffer -> Buffer SpillableBuffer -> Buffer ``` With the following relationship: ``` Buffer -> BufferOwner ExposureTrackedBuffer -> ExposureTrackedBufferOwner SpillableBuffer -> SpillableBufferOwner ``` #### Unify COW and Spilling In a follow-up PR, the spilling buffer classes will inherent from the exposure tracked buffer classes so we get the following hierarchy: ``` SpillableBufferOwner -> ExposureTrackedBufferOwner -> BufferOwner SpillableBuffer -> ExposureTrackedBuffer -> Buffer ``` Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13801
The first step towards unifying copy-on-write and spillable buffers.
This PR re-implement copy-on-write by introducing a
ExposureTrackedBuffer
andBufferSlice
. The idea is that whencopy-on-write
(and in a follow-up PR later, whenspill
) is enabled, we useBufferSlice
throughout cudf.BufferSlice
is a view of aExposureTrackedBuffer
that implements copy-on-write semantics by tracking the number ofBufferSlice
that points to the sameExposureTrackedBuffer
.Checklist
cc. @shwina, @vyasr, @galipremsagar, @wence-