-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add frombytes
to convert bytes
-like to DeviceBuffer
#253
Add frombytes
to convert bytes
-like to DeviceBuffer
#253
Conversation
7d043ba
to
1da0aa3
Compare
frombytes
to convert bytes
to DeviceBuffer
frombytes
to convert bytes
-like to DeviceBuffer
1da0aa3
to
a8a85b9
Compare
Instead of strictly requiring `frombytes` take a `bytes` object, allow other `bytes`-like objects as well. This continues to support `bytes`, but could allow things like `bytearray`s, `memoryview`s, NumPy arrays, etc. Further all of these are supported without needing to copy the underlying data.
a8a85b9
to
6eeaad8
Compare
@@ -57,17 +59,36 @@ cdef class DeviceBuffer: | |||
buf.c_obj = move(ptr) | |||
return buf | |||
|
|||
@staticmethod | |||
@cython.boundscheck(False) | |||
cdef DeviceBuffer c_frombytes(const unsigned char[::1] b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle memoryview
objects in general? If so it would be nice if we could have an overloaded non-const version to handle things like bytearray
or other non-readonly objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It handles memoryview
objects as long as they are 1-D, contiguous, and of type unsigned char
.
The const
just means that we won't modify the underlying buffer (much like if we took const char*
and were given char*
). So any mutable object meeting the constraints above could be passed in as well. This just promises we won't modify the underlying data. IOW bytearray
is permissible and we test for this below.
As to other memory types and layouts that don't meet these requirements, users could coerce them into something that meets these requirements (with a minimal amount of copying to ensure contiguous data). Decided not to do that for the user as I wanted them to know we are not preserving the type or layout of their data. Though they could take the resulting DeviceBuffer
and layer this information back on top by constructing some other array (like CuPy or Numba) and adding this info to that object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with not implicitly making things contiguous at this level. I'm surprised that a bytearray is allowed for a const unsigned char[::1]
. We previously ran into issues with using const vs non-const in handling numpy arrays that were marked read-only which is what prompted me to questioning an overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool.
Yeah I recall some cases where Cython didn't work as well with const
memoryviews. Issue ( cython/cython#1772 ) comes to mind (though that's with fused types). Not sure if there were issues with const
memoryviews on a specific type. Was this what you were thinking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: After discussion offline, we decided to add one more test using a read-only NumPy array, which has been included in commit ( 2234e5a ).
Here's a rough benchmark. In [1]: import numpy
In [2]: import numba.cuda
In [3]: import rmm
In [4]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(2**30))
Out[4]: 0
In [5]: hb = numpy.asarray(memoryview(50_000_000 * b"a"))
In [6]: %timeit numba.cuda.to_device(hb)
19.6 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit rmm.DeviceBuffer.frombytes(hb)
4.69 ms ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit numba.cuda.as_cuda_array(rmm.DeviceBuffer.frombytes(hb))
5.22 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) IOW one gets a ~4x improvement using |
Should add we may be able to speed this up further by using hugepages. |
~5ms still seems really high from my perspective, have you confirmed via a python profile that there's no silly Python overhead? |
That's 10GB/s. What would you expect instead? Edit: This is from host memory to device memory on a DGX-1. |
From reading the DGX-1's specs it states, "Pascal also supports 16 lanes of PCIe 3.0. In DGX-1, these are used for connecting between the CPUs and GPUs.". According to Wikipedia, this should give us a throughput of 15.75 GB/s. So it's possible to get 1.5x faster, but it doesn't seem like we are way slower. Please feel free to correct me. |
I brainfarted and went off by an order of magnitude 😄. In practice that tends to go more towards ~12 GB/s so we're pretty close here which looks good. |
No worries. It was good to work through the numbers. 🙂 Am curious how we might get closer to 12 GB/s (or even more) if possible. Do you have any ideas? 😉 |
I'm guessing that is lower level things like pinned memory, using gdrcopy, etc. |
Makes it easier to build a
DeviceBuffer
object from an existingbytes
-like object in pure Python code.