Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize buffer utility functions #546

Merged
merged 33 commits into from
Aug 12, 2020

Conversation

jakirkham
Copy link
Member

Rewrites buffer utility functions to get more optimizations out of Cython. This includes typing variables where possible, leveraging for-loops that are easier for Cython to lower to C, making use of Cython memoryviews, and some code simplification (where possible).

cc @pentschev @madsbk @quasiben

@pentschev
Copy link
Member

pentschev commented Jun 15, 2020

@jakirkham I'm getting some errors when I try the send/recv debug test:

$ UCX_TLS=tcp,sockcm,cuda_copy UCX_SOCKADDR_TLS_PRIORITY=sockcm python recv.py
'CUDA_VISIBLE_DEVICES'
ITER:  0
IndexError: Out of bounds on buffer access (axis 0)
Exception ignored in: 'ucp._libs.utils.get_buffer_nbytes'
IndexError: Out of bounds on buffer access (axis 0)
Traceback (most recent call last):
  File "recv.py", line 140, in <module>
    test_send_recv_cu(cupy)
  File "recv.py", line 113, in test_send_recv_cu
    client(env2, port, func)
  File "recv.py", line 58, in client
    asyncio.get_event_loop().run_until_complete(read())
  File "/datasets/pentschev/miniconda3/envs/rn-102-0.15.0b200615/lib/python3.7/asyncio/base_events.py", line 583, in run_until_complete
    return future.result()
  File "recv.py", line 47, in read
    frames, msg = await recv(ep)
  File "/datasets/pentschev/src/ucx-py/debug-tests/utils.py", line 104, in recv
    await ep.recv(frame)
  File "/datasets/pentschev/miniconda3/envs/rn-102-0.15.0b200615/lib/python3.7/site-packages/ucp/core.py", line 604, in recv
    ret = await ucx_api.tag_recv(self._ep, buffer, nbytes, tag, log_msg=log)
ucp.exceptions.UCXMsgTruncated: Comm Error "[Recv #005] ep: 0x7fcf28038048, tag: 0x68f5dfcc8a2d7c57, nbytes: 0, type: <class 'rmm._lib.device_buffer.DeviceBuffer'>": length mismatch: 472392 (got) != 0 (expected)

I'm guessing this has to do with some of the logic change in get_buffer_nbytes, but I'm not sure if this is ready to test or you're still working on it.

@jakirkham
Copy link
Member Author

Yeah there are not really tests of this functionality, which makes it hard to tell what would be expected of these functions. Think we should get some tests in first and then revisit.

@jakirkham
Copy link
Member Author

I've added some tests in PR ( #549 ).

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member Author

I wouldn't worry about reviewing this yet. We seem to still be getting failures atm. Would be good to get some tests in first ( #549 ) so we can better understand the origin of the failures.

@jakirkham jakirkham changed the title WIP: Optimize buffer utility functions Optimize buffer utility functions Aug 10, 2020
@jakirkham jakirkham marked this pull request as ready for review August 10, 2020 21:24
@jakirkham jakirkham requested a review from a team as a code owner August 10, 2020 21:24
ucp/_libs/utils.pyx Show resolved Hide resolved
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member Author

We are seeing a test failure due to a recent change in Dask ( dask/distributed#4019 ). Submitted PR ( dask/distributed#4036 ) to fix that.

@jakirkham
Copy link
Member Author

rerun tests

The `default` used in `.get(...)` is always `None`. So there is no need
to specify it. So go ahead and drop it to simplify things.
As calling `hasattr` is just expensive as `getattr` and slightly more
expensive compared to attribute access, simply call `getattr` instead
and handle the case where the attribute is not found with `None`.
To simplify later checks and operations, go ahead and assign `strides`.
This got fixed in Numba 0.46+ and RAPIDS requires an even newer version
of Numba. So we are safe to drop this workaround.
@jakirkham jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 37d8663 to 1115256 Compare August 11, 2020 05:15
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
ucp/_libs/utils.pyx Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member Author

Went back through and profiled things carefully. Noted that hasattr was costing us (more details in the OP of PR #573 since included here). So changed the code to use getattr instead, which helped quite a bit. Then made some other similar changes just focusing on the Python code.

After that went back and readded Cython types to the Python variables and converted comprehensions to for-loops. Though have skipped dealing with arrays. Arrays are a bit slower, but maybe not noticeable given the overall runtime. Maybe they can be revisited in another PR, but would prefer to skip them for now.

With all of these changes, went through and timed the handling of an N-D array (including the case where the C-contiguous check would fail). Have included the times before and after below. In short, all cases are more than 2x faster. This doesn't sound like much, but these functions are already running ~1µs.

Runtime before of get_buffer_data and get_buffer_nbytes:
In [1]: import cupy 
   ...: import numpy 
   ...:  
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes                                

In [2]: def run(func, *args, **kwargs): 
   ...:     try: 
   ...:         func(*args, **kwargs) 
   ...:     except Exception: 
   ...:         pass 
   ...:                                                                                               

In [3]: s = (10, 11, 12, 13) 
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)                                                    

In [4]: %timeit get_buffer_data(a)                                                                    
2.72 µs ± 8.47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)                          
3.7 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)                   
4.69 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Runtime after of get_buffer_data and get_buffer_nbytes
In [1]: import cupy 
   ...: import numpy 
   ...:  
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes                                

In [2]: def run(func, *args, **kwargs): 
   ...:     try: 
   ...:         func(*args, **kwargs) 
   ...:     except Exception: 
   ...:         pass 
   ...:                                                                                               

In [3]: s = (10, 11, 12, 13) 
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)                                                    

In [4]: %timeit get_buffer_data(a)                                                                    
1.18 µs ± 9.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)                          
1.66 µs ± 9.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)                   
2.33 µs ± 8.45 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Additionally we fast path objects through get_buffer_nbytes in particular if their strides is None. Currently CuPy, Numba, and NumPy already set strides to None if the data is C-contiguous. Plus serialization through Dask turns any non-contiguous array into a 1-D contiguous array. So any array that Dask gives us is sent through the fast path.

This was not true of cuDF Buffers or RMM DeviceBuffers for historical reasons (CuPy and Numba compatibility issues). Though these issues no longer apply, so have submitted PR ( rapidsai/cudf#5917 ) and PR ( rapidsai/rmm#477 ), which will fast path these objects in the future.

In short, I think collectively we've probably gotten as much out of this exercise as we can for the moment. We might be able to make further improvements by refactoring (contiguity check can be separate from size check). It's also possible a lightweight extension type could help us find the right balance between extracting the array interface and letting us query different common things easily as needed. Though this can probably be saved for a future PR.

NumPy is the main user of `__array_interface__`. However NumPy also
supports the Python Buffer Protocol, which is what is typically used for
this purpose. This can easily be used with `memoryview`s as we have done
here. So simply drop the `__array_interface__` check and use
`memoryview` for NumPy as well.
The contiguous check allowed F-contiguous arrays through. However we
don't support F-contiguous arrays in the array interface case, so update
this check accordingly.
@jakirkham
Copy link
Member Author

Switched to using memoryview for NumPy. This lets us get rid of another attribute check (already shown to be expensive). Plus we now check __cuda_array_interface__ first (as the memoryview case was a fallback).

@jakirkham jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 33ced6a to 9f278e3 Compare August 11, 2020 19:33
@jakirkham jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 9f278e3 to 920c3b8 Compare August 11, 2020 20:02
As we are now `cimport`ing `get_buffer_data` and its return type is
`uintptr_t`, these casts are no longer needed (with the exception of the
`void*` cast). So drop the extra casts.
@jakirkham
Copy link
Member Author

Have now also exposed the utility functions directly in Cython to leverage faster calling between Cython code. In Python we are still able to import the functions and use them from Python. So the Python code paths are unaffected.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work @jakirkham

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jakirkham !

@jakirkham jakirkham merged commit 181764a into rapidsai:branch-0.15 Aug 12, 2020
@jakirkham jakirkham deleted the typ_get_buf_dat_nbytes branch August 12, 2020 17:56
@jakirkham
Copy link
Member Author

Adding the benchmark here from the new Conda packages...

NumPy after:
In [1]: import cupy
   ...: import numpy
   ...: 
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes

In [2]: def run(func, *args, **kwargs):
   ...:     try:
   ...:         func(*args, **kwargs)
   ...:     except Exception:
   ...:         pass
   ...: 

In [3]: s = (10, 11, 12, 13)
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)

In [4]: %timeit get_buffer_data(a)
628 ns ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
764 ns ± 4.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
1.49 µs ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
CuPy after:
In [1]: import cupy
   ...: import numpy
   ...: 
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes

In [2]: def run(func, *args, **kwargs):
   ...:     try:
   ...:         func(*args, **kwargs)
   ...:     except Exception:
   ...:         pass
   ...: 

In [3]: s = (10, 11, 12, 13)
   ...: a = cupy.arange(cupy.prod(cupy.array(s))).reshape(s)

In [4]: %timeit get_buffer_data(a)
877 ns ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
1.3 µs ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
2.34 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@quasiben
Copy link
Member

I know I'm late but this is a really nice performance improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants