Use managed memory for NDSH benchmarks #17039

karthikeyann · 2024-10-10T05:12:01Z

Description

Fixes #16987
Use managed memory to generate the parquet data, and write parquet data to host buffer.
Replace use of parquet_device_buffer with cuio_source_sink_pair

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/benchmarks/ndsh/utilities.cpp

davidwendt · 2024-10-10T12:40:05Z

cpp/benchmarks/ndsh/utilities.cpp

+  auto old_mr = cudf::get_current_device_resource(); // fixme: already pool takes 50% of free memory
+  // TODO: release it, and restore it later?
+  auto managed_pool_mr = make_managed_pool();
+  cudf::set_current_device_resource(managed_pool_mr.get());


Why not pass the new mr all the way through instead of resetting the current one?
All the libcudf APIs should take an mr parameter now.

Thanks @davidwendt. I've asked @karthikeyann to separate the MR used for data generation versus the one used for timed query runs. We need managed memory to avoid OOM in the generator, but we mostly care about async and pool for timed runs.

@davidwendt This brings us to an old question. Right now, we can only pass mr for output values of a libcudf function. All intermediate allocations happen using cudf::get_current_device_resource(). So, if we are targeting larger than GPU memory, the intermediate allocations might run out of GPU memory if the cudf::get_current_device_resource() is not managed memory. Right now, libcudf functions does not have a way pass an mr for intermediate allocations. It's set globally using cudf::set_current_device_resource. Hence cudf::set_current_device_resource is updated here.

Ok, I thought that may be the case but wanted to make sure. It seems like we should purposely pass an mr for the returned objects if only for illustration purposes to highlight there are 2 mrs in play here.
Also, it may be worth a detailed comment in the code similar to what you responded here.

I will add it to comments.
If we have a way to solve intermediate allocations, that would be great.

Also, old_mr takes some memory (50% is default) already if it is a pooled memory resource. So, it will cause managed memory resource to spill more often. It's better if that also can be reclaimed until data generation is over. I am still working on this part; a shrink function in pooled memory resource would be great, but not available right now. releasing pool memory would be dangerous (all old allocations become dangling pointers). I am looking into other memory resources for this.

You could just request these benchmarks be run using a different default pool (command-line parameter rmm_mode) to start with. These benchmarks are not run in CI. I'm not sure it is worth circumventing the pool since that logic would need to account for the parameter setting the default to not-pool in any case.

cudf/cpp/benchmarks/fixture/nvbench_fixture.hpp

Line 126 in 7173b52

std::string rmm_mode{"pool"};

Perhaps you could even check the default mr somehow and if it is set to pool then throw an exception.

I updated the code to create managed pool, if the existing mr is not managed or managed_pool.
It has a drawback when pool memory is used, which is default. and data generation may be slow. If that's not acceptable, only way we could fix it is by creating new nvbench_fixture for ndsh benchmarks alone.

@GregoryKimball can we limit rmm_mode to be managed/managed_pool only for running these benchmarks? if we can limit to managed_pool only, no fix required for mr, just use managed_pool or managed mode in cli. Alternatively, if rmm_mode could be anything, and we still want the data generator to be fastest, we should fix this PR by creating a new nvbench_fixture.

@GregoryKimball We can run upto SF=30 on 48 GB GPU machine. is that sufficient?
can we merge this PR?

@karthikeyann and I had an offline discussion about the NDS-H-cpp benchmarks. We agreed to collect some data on the max scale factor for the simple single-MR case and the more complex two-MR case.

karthikeyann · 2024-10-11T19:22:01Z

The benchmarked query's runtime did not have any effect due to this PR change. (tested with Q05)

This is overall runtime for single run with data generation. eg. ./benchmark --axis scale_factor=20 --run-once
Old = This PR.
New = new nvbench fixture which delays creation of rmm_mode after data generation. Data generation uses managed pool.
new_ndsh_fixture.patch

Benchmark	Old Time SF=10	New Time SF=10	Old Time SF=20	New Time SF=20
NDSH_Q01_NVBENCH	0m10.998s	0m10.798s	0m31.818s	0m18.388s
NDSH_Q05_NVBENCH	0m13.232s	0m12.652s	0m35.439s	0m22.404s
NDSH_Q06_NVBENCH	0m11.227s	0m11.009s	0m31.991s	0m18.114s
NDSH_Q09_NVBENCH	0m15.518s	0m14.674s	0m39.800s	0m26.957s
NDSH_Q10_NVBENCH	0m13.164s	0m12.669s	0m35.183s	0m22.056s

If this data generation time does not matter, we can merge this PR.

…_scaling_mem

kingcrimsontianyu

Lgtm!

cpp/benchmarks/ndsh/utilities.hpp

karthikeyann · 2024-10-19T21:34:58Z

@GregoryKimball

on 48 GB GPU.

TESTED VARIANTS	Q05 Max SF
branch-24.12 (async)	16
host side staging (async)	18
hss + managed pool for datagen, (async)	30
hss + chunked pq (async)	40
hss + chunked pq (rmm pool)	27
hss + chunked pq + datagen=managed_pool, (async or pool)	60

The final version in this PR is hss + chunked pq + datagen=managed_pool.

GregoryKimball · 2024-10-21T17:53:43Z

Thank you @karthikeyann for studying this. It looks like the chunked PQ writer has a big impact as well - thank you for identifying that.

I'm happy to proceed with the current state of hss + chunked pq + datagen=managed_pool, although hss + chunked pq is acceptable as well.

cpp/benchmarks/ndsh/utilities.cpp

karthikeyann · 2024-10-23T18:22:58Z

/merge

karthikeyann added 6 commits October 10, 2024 04:58

replace parquet_device_buffer with cuio_source_sink_pair

8f7666c

replace parquet_device_buffer with cuio_source_sink_pair

b68ad60

use managed pool memory

4112bca

link cudf_benchmark_common

813d127

fix missed replacement

f3ac9c2

avoid default ctor usage with map.at call

05773a9

karthikeyann added 2 - In Progress Currently a work in progress tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2024

karthikeyann self-assigned this Oct 10, 2024

github-actions bot added the CMake CMake build issue label Oct 10, 2024

karthikeyann marked this pull request as ready for review October 10, 2024 05:12

karthikeyann requested a review from a team as a code owner October 10, 2024 05:12

karthikeyann requested review from kingcrimsontianyu and davidwendt October 10, 2024 05:12

davidwendt reviewed Oct 10, 2024

View reviewed changes

cpp/benchmarks/ndsh/utilities.cpp Outdated Show resolved Hide resolved

davidwendt reviewed Oct 10, 2024

View reviewed changes

karthikeyann added 3 commits October 11, 2024 04:35

generate only input table names

0054933

create managed pool, if not already managed

fc9e71d

style fix

d6935dc

karthikeyann requested a review from davidwendt October 11, 2024 17:44

Merge branch 'branch-24.12' of github.com:rapidsai/cudf into enh-ndsh…

38fe9f4

…_scaling_mem

karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Oct 17, 2024

davidwendt approved these changes Oct 18, 2024

View reviewed changes

kingcrimsontianyu approved these changes Oct 18, 2024

View reviewed changes

karthikeyann commented Oct 18, 2024

View reviewed changes

cpp/benchmarks/ndsh/utilities.hpp Outdated Show resolved Hide resolved

karthikeyann and others added 4 commits October 18, 2024 14:27

Update cpp/benchmarks/ndsh/utilities.hpp

ae63f07

Merge branch 'branch-24.12' into enh-ndsh_scaling_mem

97a798d

interleave parquet writing to reduce memory usage

c034731

chunked parquet writing for reducing memory writing lineitem table

d3db1ae

karthikeyann requested a review from GregoryKimball October 19, 2024 23:27

karthikeyann requested a review from mhaseeb123 October 21, 2024 18:05

mhaseeb123 reviewed Oct 21, 2024

View reviewed changes

cpp/benchmarks/ndsh/utilities.cpp Show resolved Hide resolved

mhaseeb123 approved these changes Oct 22, 2024

View reviewed changes

rapids-bot bot merged commit e7653a7 into rapidsai:branch-24.12 Oct 23, 2024
102 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use managed memory for NDSH benchmarks #17039

Use managed memory for NDSH benchmarks #17039

karthikeyann commented Oct 10, 2024 •

edited

Loading

davidwendt Oct 10, 2024

GregoryKimball Oct 10, 2024

karthikeyann Oct 10, 2024 •

edited

Loading

davidwendt Oct 10, 2024

karthikeyann Oct 10, 2024

karthikeyann Oct 10, 2024

davidwendt Oct 10, 2024

karthikeyann Oct 11, 2024

karthikeyann Oct 18, 2024

GregoryKimball Oct 18, 2024

karthikeyann commented Oct 11, 2024 •

edited

Loading

kingcrimsontianyu left a comment

karthikeyann commented Oct 19, 2024 •

edited

Loading

GregoryKimball commented Oct 21, 2024

karthikeyann commented Oct 23, 2024

Use managed memory for NDSH benchmarks #17039

Use managed memory for NDSH benchmarks #17039

Conversation

karthikeyann commented Oct 10, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann commented Oct 11, 2024 • edited Loading

kingcrimsontianyu left a comment

Choose a reason for hiding this comment

karthikeyann commented Oct 19, 2024 • edited Loading

GregoryKimball commented Oct 21, 2024

karthikeyann commented Oct 23, 2024

karthikeyann commented Oct 10, 2024 •

edited

Loading

karthikeyann Oct 10, 2024 •

edited

Loading

karthikeyann commented Oct 11, 2024 •

edited

Loading

karthikeyann commented Oct 19, 2024 •

edited

Loading