Added `SamplingResult` cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

rlratzel · 2022-09-12T00:21:48Z

Summary

(partially) closes Add support for server GPU- client GPU data xfer GaaS#23
Added SamplingResult cython extension class that corresponds to the C API's cugraph_sample_result_t type, which removes a copy operation in sampling algos.
Added new testing APIs and unit tests to validate correctness and that memory is cleaned up as expected when all refs to results are removed (see test_neighborhood_sampling.py::test_sample_result)
Updated build.sh to use ninja by default.

Background

As part of a larger effort to provide GPU-GPU data transfer in cugraph_service to improve performance, the need to minimize redundant data copy operations was also something to address.

Prior to this change, PLC algos would copy their results from the C API device data structures into one or more cupy ndarrays, which allowed the algo to then delete the result data owned by the C API and leave the lifecycle management of the results returned to python to the python interpreter/GC. The copy operation not only slowed down sampling algo calls (slightly) but also caused spikes in GPU memory usage where two copies of results existed in GPU memory for a short time immediately after the copy but before the original was deleted.

This PR adds a cython extension class that corresponds to the C API's cugraph_sample_result_t type, which is used to pass ownership of the sampling results to python. The extension class calls cugraph_sample_result_free() when all references to the result data in python are removed and the garbage collector runs. The owning SamplingResult object allows PLC sampling algos to instead return cupy ndarray "views" for results instead of copying them to new cupy ndarrays.

New test APIs to create sampling result objects in C and PLC were added and used in the new tests to verify that the ownership of the result data was properly transferred to Python, and that the lifecycle management of results was done properly when clients keep references to data (eg. pass results to other parts of code for use long after the algo completes).

A benchmark was also added to verify the (minor) speedup. Runs 0001-2 were before the change, runs 0003-5 (red box) were after:

…still need to manage lifecycle of device memory, will probably need to do this differently.

…oming) new SampleResult type

… host arg types

…ating SamplingResult instances from host arrays.

…code cleanup, added docstrings, updated test_sample_result() to add more thorough assertions and extra reference test.

…0-device_array_pybuffer

…dated SG uniform_neighbor_sample test to add benchmark and check for leaks.

… is fixed, allowing CI to make any other failures more noticeable. This will be uncommented before the PR is moved out of Draft.

codecov-commenter · 2022-09-12T03:24:09Z

Codecov Report

Base: 60.04% // Head: 57.76% // Decreases project coverage by -2.27% ⚠️

Coverage data is based on head (7e80e59) compared to base (3eb2b40).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.10    #2684      +/-   ##
================================================
- Coverage         60.04%   57.76%   -2.28%     
================================================
  Files               111      111              
  Lines              6184     6069     -115     
================================================
- Hits               3713     3506     -207     
- Misses             2471     2563      +92

Impacted Files	Coverage Δ
python/cugraph/cugraph/structure/property_graph.py	`65.02% <0.00%> (-31.87%)`	⬇️
...n/pylibcugraph/pylibcugraph/utilities/api_tools.py	`62.50% <0.00%> (-15.63%)`	⬇️
python/cugraph/cugraph/link_analysis/hits.py	`78.26% <0.00%> (-5.62%)`	⬇️
...graph/cugraph/centrality/eigenvector_centrality.py	`90.47% <0.00%> (-3.47%)`	⬇️
...thon/cugraph/cugraph/centrality/katz_centrality.py	`85.71% <0.00%> (-3.42%)`	⬇️
python/cugraph/cugraph/sampling/node2vec.py	`78.78% <0.00%> (-3.04%)`	⬇️
python/cugraph/cugraph/structure/graph_classes.py	`78.46% <0.00%> (-1.54%)`	⬇️
python/cugraph/cugraph/traversal/sssp.py	`92.79% <0.00%> (-0.65%)`	⬇️
python/cugraph/cugraph/structure/number_map.py	`67.49% <0.00%> (-0.36%)`	⬇️
...ugraph/cugraph/dask/structure/mg_property_graph.py	`14.79% <0.00%> (-0.31%)`	⬇️
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

rlratzel · 2022-09-12T15:33:11Z

I believe the CI test failures are related to the changes in this PR, so I'm looking into those.

…0-device_array_pybuffer

…g ninja in build.sh, added handling of NULL in create_cupy_array_view_for_device_ptr()

ChuckHastings · 2022-09-14T19:10:52Z

cpp/src/c_api/uniform_neighbor_sampling.cpp

+    auto p_handle = reinterpret_cast<cugraph::c_api::cugraph_resource_handle_t const*>(handle);
+    auto host_srcs =
+      reinterpret_cast<const cugraph::c_api::cugraph_type_erased_host_array_view_t*>(srcs);
+    auto new_device_srcs = new cugraph::c_api::cugraph_type_erased_device_array_t(


I would be inclined to do this a bit differently. The values new_device_srcs, new_device_dsts, new_device_wgts and new_device_counts that you're creating here are a potential for a memory leak, since they are not in RAII objects.

You could:

Do type dispatching and create these as rmm::device_uvector<appropriate_type> and then convert to the return value. This would mimic the pattern of the algorithm implementation.

Create these as std::unique_ptr values and release them when you convert to the return value

Create the return value as a std::unique_ptr (releasing upon return) with the new calls for these operators inside and then just use the values from the return object. We do know the sizes a priori

Instead of this suggestion, I changed the implementation to use other cugraph C API calls (as suggested below), which included using cugraph_type_erased_device_array_create_from_view() from #2712 . Let me know if you'd prefer I use the rmm:device_uvector/type dispatching approach instead.

cpp/include/cugraph_c/sampling_algorithms.h

cpp/src/c_api/uniform_neighbor_sampling.cpp

…0-device_array_pybuffer

…s an API intended for use in testing, changed implementation of cugraph_sampling_result_ctreate() to use new APIs fro creating and copying a device array from a device array view (which required pulling in cugraph_type_erased_device_array_create_from_view() from another dev branch), changed tests to use device array instead of host array, style check fixes.

…leak is being addressed.

cpp/src/c_api/uniform_neighbor_sampling.cpp

…ng new cugraph_sample_result_t, in order to guarantee they are cleaned up on error.

BradReesWork · 2022-09-27T13:39:17Z

is there, or should there, be a macro for if (error_code != CUGRAPH_SUCCESS) return error_code;?

…ckages in build/artifact dirs instead of what it thinks are better candidates from other channels. This may fix the problem of CI testing against some other version of cugraph instead of the version built in the PR.

…y error message in CI log.

galipremsagar · 2022-09-30T00:06:41Z

rerun tests

rlratzel · 2022-09-30T13:59:31Z

@gpucibot merge

In #2684, the CMake build system was switched from `make` to `ninja`, which broke these doc builds. This PR updates the scripts for the nightly docs builds to use a generic `cmake` build option, rather than a build-system specific invocation to fix the issue and prevent a similar issue from occurring in the future. **Note**: this PR targets `branch-22.10` since the the `22.10` docs aren't currently up-to-date because of this issue.

In #2684, the CMake build system was switched from `make` to `ninja`, which broke the nightly doc builds. This PR updates the scripts for the nightly doc builds to use a generic `cmake` build option, rather than a build-system specific invocation to fix the issue and prevent a similar issue from occurring in the future. **Note**: this PR targets `branch-22.10` since the the `22.10` docs aren't currently up-to-date because of this issue. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Jordan Jacobelli (https://github.com/Ethyling)

…ure to allow for a client-side device to directly receive results (#2715) closes rapidsai/GaaS#23 ### Summary * Adds an API option to `uniform_neighbor_sample()` to allow clients to specify a client-side GPU (using device number) which will receive the results, eliminating the expensive server-side device-host copy in order to serialize and send via RPC. * Adds test(s) and benchmarks to cover the new feature * Also adds a `--pydevelop` option to `build.sh` for building python components that allow for making edits without re-installing (runs `setup.py` with the `develop` option) ### Details * This PR depends on #2684 ### Benchmarks New benchmarks have been added by this PR, with the results shown below. The results to take note of are in the red rectangles. `device=None` is the previous version using serialization and transfer from the server to the client using host memory (using Apache Thrift), and `device=0` is a GPU-GPU transfer using UCX-Py from the server on GPU1 to the client on GPU0. A typical data transfer for many GNN use cases involves ~1M values, which is what the highlighted benchmark simulates. For smaller transfers, the Thrift-based RPC is faster, but increases significantly as the payload increases (so much so that the largest size that could be run in a reasonable time for Thrift was 1e6, vs UCX-Py sizes of 2e9 still being much faster). UCX-Py seems to have some amount of fixed overhead, but the benefits are dramatic at larger sizes. ![image](https://user-images.githubusercontent.com/3039903/197084418-7a1b638f-c963-4aca-a091-43138bc33f80.png) Authors: - Rick Ratzel (https://github.com/rlratzel) Approvers: - Joseph Nke (https://github.com/jnke2016) - Alex Barghi (https://github.com/alexbarghi-nv) - Brad Rees (https://github.com/BradReesWork) URL: #2715

rlratzel added 9 commits September 9, 2022 07:39

WIP of creating a cupy array from a device buffer w/out copying, but …

48463bb

…still need to manage lifecycle of device memory, will probably need to do this differently.

Added cugraph_sample_result_create() to C API, added PLC test for (up…

ece5e05

…oming) new SampleResult type

Switched to const host array views for cugraph_sample_result_create()…

de13a81

… host arg types

WIP: adding cython def for a SamplingResult type, added utils for cre…

6462c74

…ating SamplingResult instances from host arrays.

WIP: Added SamplingResult cython class impl.

8fa9bc3

Initial passing test for SampleResult type.

a117814

Removed unused functions/changes added as part of prior WIP efforts, …

74e57ef

…code cleanup, added docstrings, updated test_sample_result() to add more thorough assertions and extra reference test.

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

e5f8dfb

…0-device_array_pybuffer

Updated uniform_neighbor_sample() to use new SamplingResult class, up…

6041ac5

…dated SG uniform_neighbor_sample test to add benchmark and check for leaks.

rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 12, 2022

rlratzel added this to the 22.10 milestone Sep 12, 2022

rlratzel self-assigned this Sep 12, 2022

rlratzel added 2 commits September 11, 2022 19:54

flake8 fixes, commented out known failing assertion until memory leak…

e57ba55

… is fixed, allowing CI to make any other failures more noticeable. This will be uncommented before the PR is moved out of Draft.

More style fixes.

fde1c2c

BradReesWork requested review from seunghwak, jnke2016 and ChuckHastings September 12, 2022 14:14

rlratzel added 3 commits September 12, 2022 11:34

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

e136dc3

…0-device_array_pybuffer

Added ninja to dev conda envs to support new default behavior of usin…

813d779

…g ninja in build.sh, added handling of NULL in create_cupy_array_view_for_device_ptr()

Various updates to comments and help strings for clarity.

a0e027a

ChuckHastings reviewed Sep 14, 2022

View reviewed changes

rlratzel mentioned this pull request Sep 22, 2022

Adds API option to uniform_neighbor_sample() and UCX-Py infrastructure to allow for a client-side device to directly receive results #2715

Merged

rlratzel added 5 commits September 22, 2022 10:18

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

f77b65b

…0-device_array_pybuffer

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

af7f76e

…0-device_array_pybuffer

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

81c9195

…0-device_array_pybuffer

Merge remote-tracking branch 'upstream/branch-22.10' into branch-22.1…

afc572c

…0-device_array_pybuffer

rlratzel marked this pull request as ready for review September 26, 2022 17:42

rlratzel requested review from a team as code owners September 26, 2022 17:42

Updated comment to indicate PR will not be in Draft while the memory …

2095117

…leak is being addressed.

ChuckHastings reviewed Sep 26, 2022

View reviewed changes

cpp/src/c_api/uniform_neighbor_sampling.cpp Show resolved Hide resolved

Updated to use unique_ptr to manage new device arrays prior to creati…

2d594ae

…ng new cugraph_sample_result_t, in order to guarantee they are cleaned up on error.

ChuckHastings approved these changes Sep 27, 2022

View reviewed changes

jjacobelli approved these changes Sep 27, 2022

View reviewed changes

BradReesWork approved these changes Sep 27, 2022

View reviewed changes

rlratzel added 5 commits September 27, 2022 17:56

Removing pyraft from gpu CI script (quick copy-paste from rapidsai#2752)

01537d4

Setting preserve_egg_dir: True in cugraph conda recipe as suggested b…

63b98dc

…y error message in CI log.

Added setuptools pin to PLC recipe to prevent CI build breakages.

50f8d64

Adding changes from rapidsai#2759

86043e4

galipremsagar approved these changes Sep 29, 2022

View reviewed changes

Added a "conda list" call to identify dependency versions for debugging.

7e80e59

Updated for compatibility with pandas 1.5.0

bff6057

rapids-bot bot merged commit a299645 into rapidsai:branch-22.10 Sep 30, 2022

This was referenced Sep 30, 2022

[REVIEW] Pin dask and distributed for release #2758

Merged

Removing pyraft from gpu CI script #2752

Closed

ajschmidt8 mentioned this pull request Oct 17, 2022

Fix docs builds [skip ci] #2814

Merged

rlratzel deleted the branch-22.10-device_array_pybuffer branch September 28, 2023 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `SamplingResult` cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

Added `SamplingResult` cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

rlratzel commented Sep 12, 2022 •

edited

Loading

codecov-commenter commented Sep 12, 2022 •

edited

Loading

rlratzel commented Sep 12, 2022

ChuckHastings Sep 14, 2022

rlratzel Sep 24, 2022

BradReesWork commented Sep 27, 2022

galipremsagar commented Sep 30, 2022

rlratzel commented Sep 30, 2022

Added SamplingResult cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

Added SamplingResult cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

Conversation

rlratzel commented Sep 12, 2022 • edited Loading

Summary

Background

codecov-commenter commented Sep 12, 2022 • edited Loading

Codecov Report

rlratzel commented Sep 12, 2022

ChuckHastings Sep 14, 2022

Choose a reason for hiding this comment

rlratzel Sep 24, 2022

Choose a reason for hiding this comment

BradReesWork commented Sep 27, 2022

galipremsagar commented Sep 30, 2022

rlratzel commented Sep 30, 2022

Added `SamplingResult` cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

Added `SamplingResult` cdef class to return cupy "views" for PLC sampling algos instead of copying result data #2684

rlratzel commented Sep 12, 2022 •

edited

Loading

codecov-commenter commented Sep 12, 2022 •

edited

Loading