Optimize buffer utility functions #546

jakirkham · 2020-06-12T23:17:35Z

Rewrites buffer utility functions to get more optimizations out of Cython. This includes typing variables where possible, leveraging for-loops that are easier for Cython to lower to C, making use of Cython memoryviews, and some code simplification (where possible).

cc @pentschev @madsbk @quasiben

pentschev · 2020-06-15T13:02:05Z

@jakirkham I'm getting some errors when I try the send/recv debug test:

$ UCX_TLS=tcp,sockcm,cuda_copy UCX_SOCKADDR_TLS_PRIORITY=sockcm python recv.py
'CUDA_VISIBLE_DEVICES'
ITER:  0
IndexError: Out of bounds on buffer access (axis 0)
Exception ignored in: 'ucp._libs.utils.get_buffer_nbytes'
IndexError: Out of bounds on buffer access (axis 0)
Traceback (most recent call last):
  File "recv.py", line 140, in <module>
    test_send_recv_cu(cupy)
  File "recv.py", line 113, in test_send_recv_cu
    client(env2, port, func)
  File "recv.py", line 58, in client
    asyncio.get_event_loop().run_until_complete(read())
  File "/datasets/pentschev/miniconda3/envs/rn-102-0.15.0b200615/lib/python3.7/asyncio/base_events.py", line 583, in run_until_complete
    return future.result()
  File "recv.py", line 47, in read
    frames, msg = await recv(ep)
  File "/datasets/pentschev/src/ucx-py/debug-tests/utils.py", line 104, in recv
    await ep.recv(frame)
  File "/datasets/pentschev/miniconda3/envs/rn-102-0.15.0b200615/lib/python3.7/site-packages/ucp/core.py", line 604, in recv
    ret = await ucx_api.tag_recv(self._ep, buffer, nbytes, tag, log_msg=log)
ucp.exceptions.UCXMsgTruncated: Comm Error "[Recv #005] ep: 0x7fcf28038048, tag: 0x68f5dfcc8a2d7c57, nbytes: 0, type: <class 'rmm._lib.device_buffer.DeviceBuffer'>": length mismatch: 472392 (got) != 0 (expected)

I'm guessing this has to do with some of the logic change in get_buffer_nbytes, but I'm not sure if this is ready to test or you're still working on it.

jakirkham · 2020-06-15T17:23:59Z

Yeah there are not really tests of this functionality, which makes it hard to tell what would be expected of these functions. Think we should get some tests in first and then revisit.

jakirkham · 2020-06-16T01:04:28Z

I've added some tests in PR ( #549 ).

ucp/_libs/utils.pyx

jakirkham · 2020-06-16T02:07:11Z

I wouldn't worry about reviewing this yet. We seem to still be getting failures atm. Would be good to get some tests in first ( #549 ) so we can better understand the origin of the failures.

ucp/_libs/utils.pyx

jakirkham · 2020-08-10T21:55:53Z

We are seeing a test failure due to a recent change in Dask ( dask/distributed#4019 ). Submitted PR ( dask/distributed#4036 ) to fix that.

jakirkham · 2020-08-11T01:33:53Z

rerun tests

The `default` used in `.get(...)` is always `None`. So there is no need to specify it. So go ahead and drop it to simplify things.

As calling `hasattr` is just expensive as `getattr` and slightly more expensive compared to attribute access, simply call `getattr` instead and handle the case where the attribute is not found with `None`.

To simplify later checks and operations, go ahead and assign `strides`.

This got fixed in Numba 0.46+ and RAPIDS requires an even newer version of Numba. So we are safe to drop this workaround.

ucp/_libs/utils.pyx

jakirkham · 2020-08-11T07:29:05Z

Went back through and profiled things carefully. Noted that hasattr was costing us (more details in the OP of PR #573 since included here). So changed the code to use getattr instead, which helped quite a bit. Then made some other similar changes just focusing on the Python code.

After that went back and readded Cython types to the Python variables and converted comprehensions to for-loops. Though have skipped dealing with arrays. Arrays are a bit slower, but maybe not noticeable given the overall runtime. Maybe they can be revisited in another PR, but would prefer to skip them for now.

With all of these changes, went through and timed the handling of an N-D array (including the case where the C-contiguous check would fail). Have included the times before and after below. In short, all cases are more than 2x faster. This doesn't sound like much, but these functions are already running ~1µs.

Runtime before of get_buffer_data and get_buffer_nbytes:

In [1]: import cupy 
   ...: import numpy 
   ...:  
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes                                

In [2]: def run(func, *args, **kwargs): 
   ...:     try: 
   ...:         func(*args, **kwargs) 
   ...:     except Exception: 
   ...:         pass 
   ...:                                                                                               

In [3]: s = (10, 11, 12, 13) 
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)                                                    

In [4]: %timeit get_buffer_data(a)                                                                    
2.72 µs ± 8.47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)                          
3.7 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)                   
4.69 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Runtime after of get_buffer_data and get_buffer_nbytes

In [1]: import cupy 
   ...: import numpy 
   ...:  
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes                                

In [2]: def run(func, *args, **kwargs): 
   ...:     try: 
   ...:         func(*args, **kwargs) 
   ...:     except Exception: 
   ...:         pass 
   ...:                                                                                               

In [3]: s = (10, 11, 12, 13) 
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)                                                    

In [4]: %timeit get_buffer_data(a)                                                                    
1.18 µs ± 9.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)                          
1.66 µs ± 9.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)                   
2.33 µs ± 8.45 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Additionally we fast path objects through get_buffer_nbytes in particular if their strides is None. Currently CuPy, Numba, and NumPy already set strides to None if the data is C-contiguous. Plus serialization through Dask turns any non-contiguous array into a 1-D contiguous array. So any array that Dask gives us is sent through the fast path.

This was not true of cuDF Buffers or RMM DeviceBuffers for historical reasons (CuPy and Numba compatibility issues). Though these issues no longer apply, so have submitted PR ( rapidsai/cudf#5917 ) and PR ( rapidsai/rmm#477 ), which will fast path these objects in the future.

In short, I think collectively we've probably gotten as much out of this exercise as we can for the moment. We might be able to make further improvements by refactoring (contiguity check can be separate from size check). It's also possible a lightweight extension type could help us find the right balance between extracting the array interface and letting us query different common things easily as needed. Though this can probably be saved for a future PR.

NumPy is the main user of `__array_interface__`. However NumPy also supports the Python Buffer Protocol, which is what is typically used for this purpose. This can easily be used with `memoryview`s as we have done here. So simply drop the `__array_interface__` check and use `memoryview` for NumPy as well.

The contiguous check allowed F-contiguous arrays through. However we don't support F-contiguous arrays in the array interface case, so update this check accordingly.

jakirkham · 2020-08-11T18:08:15Z

Switched to using memoryview for NumPy. This lets us get rid of another attribute check (already shown to be expensive). Plus we now check __cuda_array_interface__ first (as the memoryview case was a fallback).

As we are now `cimport`ing `get_buffer_data` and its return type is `uintptr_t`, these casts are no longer needed (with the exception of the `void*` cast). So drop the extra casts.

jakirkham · 2020-08-11T21:29:47Z

Have now also exposed the utility functions directly in Cython to leverage faster calling between Cython code. In Python we are still able to import the functions and use them from Python. So the Python code paths are unaffected.

madsbk

LGTM, nice work @jakirkham

pentschev

LGTM, thanks @jakirkham !

jakirkham · 2020-08-13T17:41:12Z

Adding the benchmark here from the new Conda packages...

NumPy after:

In [1]: import cupy
   ...: import numpy
   ...: 
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes

In [2]: def run(func, *args, **kwargs):
   ...:     try:
   ...:         func(*args, **kwargs)
   ...:     except Exception:
   ...:         pass
   ...: 

In [3]: s = (10, 11, 12, 13)
   ...: a = numpy.arange(numpy.prod(s)).reshape(s)

In [4]: %timeit get_buffer_data(a)
628 ns ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
764 ns ± 4.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
1.49 µs ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

CuPy after:

In [1]: import cupy
   ...: import numpy
   ...: 
   ...: from ucp._libs.utils import get_buffer_data, get_buffer_nbytes

In [2]: def run(func, *args, **kwargs):
   ...:     try:
   ...:         func(*args, **kwargs)
   ...:     except Exception:
   ...:         pass
   ...: 

In [3]: s = (10, 11, 12, 13)
   ...: a = cupy.arange(cupy.prod(cupy.array(s))).reshape(s)

In [4]: %timeit get_buffer_data(a)
877 ns ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit get_buffer_nbytes(a, check_min_size=None, cuda_support=True)
1.3 µs ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit run(get_buffer_nbytes, a.T, check_min_size=None, cuda_support=True)
2.34 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

quasiben · 2020-08-14T14:24:10Z

I know I'm late but this is a really nice performance improvement!

jakirkham mentioned this pull request Jun 16, 2020

Test get_buffer_data and get_buffer_nbytes #549

Merged

kkraus14 reviewed Jun 16, 2020

View reviewed changes

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

jakirkham changed the title ~~WIP: Optimize buffer utility functions~~ Optimize buffer utility functions Aug 10, 2020

jakirkham marked this pull request as ready for review August 10, 2020 21:24

jakirkham requested a review from a team as a code owner August 10, 2020 21:24

kkraus14 reviewed Aug 10, 2020

View reviewed changes

ucp/_libs/utils.pyx Show resolved Hide resolved

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

jakirkham added 8 commits August 10, 2020 21:24

Space out code from docstring

376df5c

Space out license header and directives

96d0802

Include 2020 in copyright

7d196b5

Use double quotes

0e34465

Drop None from default in .get(...)

468e5ae

The `default` used in `.get(...)` is always `None`. So there is no need to specify it. So go ahead and drop it to simplify things.

Just use getattr instead of hasattr

79d7673

As calling `hasattr` is just expensive as `getattr` and slightly more expensive compared to attribute access, simply call `getattr` instead and handle the case where the attribute is not found with `None`.

Assign strides from iface

8b11ecb

To simplify later checks and operations, go ahead and assign `strides`.

Include itemsize in nbytes reduction

4eff637

jakirkham force-pushed the typ_get_buf_dat_nbytes branch from cc2dab2 to 10960ba Compare August 11, 2020 04:58

jakirkham mentioned this pull request Aug 11, 2020

Use getattr instead of hasattr #573

Closed

jakirkham added 4 commits August 10, 2020 22:15

Inline message into ValueError

64d874e

Drop old Numba workaround

2b0d514

This got fixed in Numba 0.46+ and RAPIDS requires an even newer version of Numba. So we are safe to drop this workaround.

Let Cython coerce uintptr_t to a Python int

64ef5ec

Get readonly from the buffer protocol too

1115256

jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 37d8663 to 1115256 Compare August 11, 2020 05:15

jakirkham added 3 commits August 10, 2020 22:19

Inline error message

60b448c

Drop unused import

b840202

Assign len(shape) to ndim for simplicity

7fec7d7

madsbk reviewed Aug 11, 2020

View reviewed changes

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

ucp/_libs/utils.pyx Outdated Show resolved Hide resolved

jakirkham added 2 commits August 11, 2020 10:44

Check memoryview is C-contiguous

686ff3e

The contiguous check allowed F-contiguous arrays through. However we don't support F-contiguous arrays in the array interface case, so update this check accordingly.

Define min_size near usage

55ce3d2

jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 33ced6a to 9f278e3 Compare August 11, 2020 19:33

Add .pxd file for Cython utils

920c3b8

jakirkham force-pushed the typ_get_buf_dat_nbytes branch from 9f278e3 to 920c3b8 Compare August 11, 2020 20:02

jakirkham added 6 commits August 11, 2020 13:05

Add __init__.pxd for cimports

840ad9b

Use relative cimport for ucx_api_dep

345ccab

Use cimport for get_buffer_data

bcb7aea

Drop uintptr_t cast with get_buffer_data

9825388

As we are now `cimport`ing `get_buffer_data` and its return type is `uintptr_t`, these casts are no longer needed (with the exception of the `void*` cast). So drop the extra casts.

Use relative topological_distance_dep cimport

4c09768

Run isort on ucp/_libs/ucx_api.pyx

e8d164e

jakirkham requested review from madsbk, pentschev and kkraus14 August 11, 2020 21:23

kkraus14 approved these changes Aug 11, 2020

View reviewed changes

jakirkham added 2 commits August 11, 2020 16:05

Split out directives from license comment

3021780

madsbk approved these changes Aug 12, 2020

View reviewed changes

pentschev approved these changes Aug 12, 2020

View reviewed changes

jakirkham merged commit 181764a into rapidsai:branch-0.15 Aug 12, 2020

jakirkham deleted the typ_get_buf_dat_nbytes branch August 12, 2020 17:56

This was referenced Aug 28, 2020

Use Python C API in util functions w/host buffers #592

Merged

Further streamline get_buffer_data #593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize buffer utility functions #546

Optimize buffer utility functions #546

jakirkham commented Jun 12, 2020

pentschev commented Jun 15, 2020 •

edited

Loading

jakirkham commented Jun 15, 2020

jakirkham commented Jun 16, 2020

jakirkham commented Jun 16, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

madsbk left a comment

pentschev left a comment

jakirkham commented Aug 13, 2020

quasiben commented Aug 14, 2020

Optimize buffer utility functions #546

Optimize buffer utility functions #546

Conversation

jakirkham commented Jun 12, 2020

pentschev commented Jun 15, 2020 • edited Loading

jakirkham commented Jun 15, 2020

jakirkham commented Jun 16, 2020

jakirkham commented Jun 16, 2020

jakirkham commented Aug 10, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

jakirkham commented Aug 11, 2020

madsbk left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

jakirkham commented Aug 13, 2020

quasiben commented Aug 14, 2020

pentschev commented Jun 15, 2020 •

edited

Loading