-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize buffer utility functions #546
Optimize buffer utility functions #546
Conversation
@jakirkham I'm getting some errors when I try the send/recv debug test:
I'm guessing this has to do with some of the logic change in |
Yeah there are not really tests of this functionality, which makes it hard to tell what would be expected of these functions. Think we should get some tests in first and then revisit. |
I've added some tests in PR ( #549 ). |
I wouldn't worry about reviewing this yet. We seem to still be getting failures atm. Would be good to get some tests in first ( #549 ) so we can better understand the origin of the failures. |
We are seeing a test failure due to a recent change in Dask ( dask/distributed#4019 ). Submitted PR ( dask/distributed#4036 ) to fix that. |
rerun tests |
The `default` used in `.get(...)` is always `None`. So there is no need to specify it. So go ahead and drop it to simplify things.
As calling `hasattr` is just expensive as `getattr` and slightly more expensive compared to attribute access, simply call `getattr` instead and handle the case where the attribute is not found with `None`.
To simplify later checks and operations, go ahead and assign `strides`.
cc2dab2
to
10960ba
Compare
This got fixed in Numba 0.46+ and RAPIDS requires an even newer version of Numba. So we are safe to drop this workaround.
37d8663
to
1115256
Compare
Went back through and profiled things carefully. Noted that After that went back and readded Cython types to the Python variables and converted comprehensions to With all of these changes, went through and timed the handling of an N-D array (including the case where the C-contiguous check would fail). Have included the times before and after below. In short, all cases are more than 2x faster. This doesn't sound like much, but these functions are already running Runtime before of
|
NumPy is the main user of `__array_interface__`. However NumPy also supports the Python Buffer Protocol, which is what is typically used for this purpose. This can easily be used with `memoryview`s as we have done here. So simply drop the `__array_interface__` check and use `memoryview` for NumPy as well.
The contiguous check allowed F-contiguous arrays through. However we don't support F-contiguous arrays in the array interface case, so update this check accordingly.
Switched to using |
33ced6a
to
9f278e3
Compare
9f278e3
to
920c3b8
Compare
As we are now `cimport`ing `get_buffer_data` and its return type is `uintptr_t`, these casts are no longer needed (with the exception of the `void*` cast). So drop the extra casts.
Have now also exposed the utility functions directly in Cython to leverage faster calling between Cython code. In Python we are still able to import the functions and use them from Python. So the Python code paths are unaffected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work @jakirkham
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jakirkham !
Adding the benchmark here from the new Conda packages... NumPy after:In [1]: import cupy
...: import numpy
...:
...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes
In [2]: def run(func, *args, **kwargs):
...: try:
...: func(*args, **kwargs)
...: except Exception:
...: pass
...:
In [3]: s = (10, 11, 12, 13)
...: a = numpy.arange(numpy.prod(s)).reshape(s)
In [4]: %timeit get_buffer_data(a)
628 ns ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
764 ns ± 4.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
1.49 µs ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) CuPy after:In [1]: import cupy
...: import numpy
...:
...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes
In [2]: def run(func, *args, **kwargs):
...: try:
...: func(*args, **kwargs)
...: except Exception:
...: pass
...:
In [3]: s = (10, 11, 12, 13)
...: a = cupy.arange(cupy.prod(cupy.array(s))).reshape(s)
In [4]: %timeit get_buffer_data(a)
877 ns ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
1.3 µs ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
2.34 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
I know I'm late but this is a really nice performance improvement! |
Rewrites buffer utility functions to get more optimizations out of Cython. This includes typing variables where possible, leveraging
for
-loops that are easier for Cython to lower to C, making use of Cythonmemoryview
s, and some code simplification (where possible).cc @pentschev @madsbk @quasiben