Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer: make .ptr read-only #10872

Merged
merged 11 commits into from
May 25, 2022
Merged

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented May 17, 2022

This PR makes Buffer.ptr read-only and introduce Buffer.from_buffer:

@classmethod
def from_buffer(cls, buffer: Buffer, size: int = None, offset: int = 0):
    """
    Create a buffer from another buffer

    Parameters
    ----------
    buffer : Buffer
        The base buffer, which will also be set as the owner of
        the memory allocation.
    size : int, optional
        Size of the memory allocation (default: `buffer.size`).
    offset : int, optional
        Start offset relative to `buffer.ptr`.
    """

This is mainly motivated by my work on spilling by making it a bit easier to reason about the relationship between buffers.

@madsbk madsbk added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function breaking Breaking change labels May 17, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 17, 2022
@codecov
Copy link

codecov bot commented May 17, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@6acf226). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10872   +/-   ##
===============================================
  Coverage                ?   86.83%           
===============================================
  Files                   ?      144           
  Lines                   ?    23721           
  Branches                ?        0           
===============================================
  Hits                    ?    20598           
  Misses                  ?     3123           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6acf226...ec69d75. Read the comment docs.

@madsbk madsbk marked this pull request as ready for review May 17, 2022 14:37
@madsbk madsbk requested a review from a team as a code owner May 17, 2022 14:37
@madsbk madsbk requested review from vyasr and bdice May 17, 2022 14:37
@madsbk madsbk added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 17, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some high-level suggestions that may be beyond the scope of this PR. The most actionable suggestion I have for this PR is to make _size private. Beyond that, it might be good for us to consider the design of the class's constructor and factories as a whole. I don't know if we're locked into a specific API by external users or matching other packages, but we might be able to make this entire class less awkward.

python/cudf/cudf/core/buffer.py Outdated Show resolved Hide resolved

def __init__(
self, data: Any = None, size: Optional[int] = None, owner: Any = None
self, data: Any = None, size: int = None, owner: object = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that size is only used when data is an integer (representing a pointer) and is otherwise ignored. It seems like a more natural API would be to accept a tuple (pointer, size) rather than exposing a size argument that is ignored by a majority of the code paths.

Is there a reason to keep this API for aligning with some other class(es) outside of this package?

(Not necessary to address in this PR but we may want to open an issue.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it not required to assign "Optional" types for parameters defaulting to None? (None is not an int, so I'm surprised mypy permits this without complaint.)

Copy link
Member Author

@madsbk madsbk May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep this API for aligning with some other class(es) outside of this package?

I don't think so, @vyasr do you know?

Also, is it not required to assign "Optional" types for parameters defaulting to None? (None is not an int, so I'm surprised mypy permits this without complaint.)

Yes and No. This is what we do throughout the RAPIDS code base and it is allowed by PEP-484. However, I see now that PEP-484 has been changed to recommend the use of Optional always.

Personally, I much prefer the old way. I think adding Optional to the vast majority of optional arguments bloats the code unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, @vyasr do you know?

Nope, it should be fine. There is a remote possibility that something in dask_cudf or dask-cuda is relying on the current behavior, but in either case we own those and can push the fixes downstream if we need to. That said, definitely agree with putting that fix in a subsequent PR.

Personally, I much prefer the old way. I think adding Optional to the vast majority of optional arguments bloats the code unnecessary.

I prefer having a style guide telling everybody what to do ;) I think we are currently very inconsistent in how we document optional parameters, but we may as well stick to what PEP 484 currently recommends since there's no telling when that will become a mypy error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having a style guide telling everybody what to do ;) I think we are currently very inconsistent in how we document optional parameters, but we may as well stick to what PEP 484 currently recommends since there's no telling when that will become a mypy error.

Fair enough :) Maybe we should do a type hint overhauling at some point.

Copy link
Contributor

@bdice bdice May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsbk Can you also discuss why you changed Any to object? The difference is subtle but significant. I see it discussed on StackOverflow but I'd be interested to hear your rationale.

Copy link
Member Author

@madsbk madsbk May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using object, we assume that owner has the same attributes as object and nothing else. Thus, mypy will complain if we do something like len(owner). This is exactly what we want since Buffer only uses owner to keep a reference to the underlying device memory alive.
On the other hand when using Any we assume that owner has any attribute thus mypy will accept anything including len(owner).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks for that explanation, that aligns with my understanding as well.

@@ -72,23 +71,48 @@ def __init__(
raise TypeError("data must be Buffer, array-like or integer")
self._init_from_array_like(np.asarray(data), owner)

@classmethod
def from_buffer(cls, buffer: Buffer, size: int = None, offset: int = 0):
Copy link
Contributor

@bdice bdice May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider a slice API instead of a classmethod factory here. It might be more Pythonic?

Before:

buf = Buffer.from_buffer(
    buffer=codes_column.base_data,
    size=self.size * codes_column.dtype.itemsize,
    offset=self.offset * codes_column.dtype.itemsize,
)

After:

start_bytes = self.offset * codes_column.dtype.itemsize
end_bytes = start_bytes + self.size * codes_column.dtype.itemsize
buf = codes_column.base_data[start_bytes:end_bytes]

Suggested implementation:

def __getitem__(self, key):
    if not isinstance(key, slice):
        raise ValueError("Buffers can be sliced but individual items cannot be accessed.")
    start, stop, step = key.start, key.stop, key.step
    if step != 1:
        raise ValueError("A Buffer can only be sliced with step = 1.")
    if start is None:
        start = 0
    if stop is None:
        stop = len(self)
    if start < 0:
        start += len(self)
    if stop < 0:
        stop += len(self)
    if stop > len(self) or start < 0:
        raise ValueError("The slice start and stop must be in bounds.")
    if start > stop:
        raise ValueError("The slice start cannot be greater than the slice stop.")
    ret = cls()
    ret._ptr = self._ptr + start
    ret.size = stop - start
    ret._owner = self
    return ret


ret = cls()
ret._ptr = buffer._ptr + offset
ret.size = buffer.size if size is None else size
Copy link
Contributor

@bdice bdice May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider requiring size is not None if offset is provided (or doing bounds checks).

Buffer.from_buffer(buf, offset=32) will create a buffer with the same size as buf, but its end points out of bounds for the original buffer. As a result, it's slightly easier to misuse this API than a slice-based API like buf[32:].

@vyasr
Copy link
Contributor

vyasr commented May 17, 2022

I have some high-level suggestions that may be beyond the scope of this PR. The most actionable suggestion I have for this PR is to make _size private. Beyond that, it might be good for us to consider the design of the class's constructor and factories as a whole. I don't know if we're locked into a specific API by external users or matching other packages, but we might be able to make this entire class less awkward.

There have been lots of discussions about the overall purpose, design, and appropriate home for this class. rapidsai/rmm#227 is quite relevant. As you pointed out, most of these discussions are beyond the scope of this PR, so I would recommend that we not pursue them too much further here, but I'm happy to continue more discussions offline. I think keeping this PR narrow would be helpful to let Mads' work on spilling proceed without blockers.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small change requests and one API question about size and nbytes being synonyms. Otherwise LGTM. This is now on my radar to make a follow-up PR that can improve these APIs.

python/cudf/cudf/core/buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer.py Show resolved Hide resolved
@madsbk
Copy link
Member Author

madsbk commented May 18, 2022

Two small change requests and one API question about size and nbytes being synonyms. Otherwise LGTM. This is now on my radar to make a follow-up PR that can improve these APIs.

Thanks @bdice.
As part of the spilling work, I have to make more changes to Buffer or alternatively introduce a new kind of Buffer so please wait a bit with a follow-up PR :)

@bdice
Copy link
Contributor

bdice commented May 18, 2022

Sounds good @madsbk. Thanks again for your patience and explanation.

@madsbk
Copy link
Member Author

madsbk commented May 25, 2022

@vyasr, I think this is ready to be merged

@shwina
Copy link
Contributor

shwina commented May 25, 2022

@madsbk I don't foresee any issues from merging this PR. But we are technically in code freeze for 22.06, and this PR touches some pretty deep internals. Can we target 22.08 instead? I'll merge once it's retargeted.

@madsbk
Copy link
Member Author

madsbk commented May 25, 2022 via email

@shwina shwina changed the base branch from branch-22.06 to branch-22.08 May 25, 2022 16:41
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving & merging.

@shwina
Copy link
Contributor

shwina commented May 25, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ca4bc97 into rapidsai:branch-22.08 May 25, 2022
@madsbk
Copy link
Member Author

madsbk commented May 26, 2022

Thanks everyone!

@madsbk madsbk mentioned this pull request Aug 10, 2022
3 tasks
@madsbk madsbk deleted the buffer-ptr-read-only branch November 21, 2022 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants