Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use managed memory for NDSH benchmarks #17039

Merged
merged 14 commits into from
Oct 23, 2024

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 10, 2024

Description

Fixes #16987
Use managed memory to generate the parquet data, and write parquet data to host buffer.
Replace use of parquet_device_buffer with cuio_source_sink_pair

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2024
@karthikeyann karthikeyann self-assigned this Oct 10, 2024
@github-actions github-actions bot added the CMake CMake build issue label Oct 10, 2024
@karthikeyann karthikeyann marked this pull request as ready for review October 10, 2024 05:12
@karthikeyann karthikeyann requested a review from a team as a code owner October 10, 2024 05:12
auto old_mr = cudf::get_current_device_resource(); // fixme: already pool takes 50% of free memory
// TODO: release it, and restore it later?
auto managed_pool_mr = make_managed_pool();
cudf::set_current_device_resource(managed_pool_mr.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass the new mr all the way through instead of resetting the current one?
All the libcudf APIs should take an mr parameter now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidwendt. I've asked @karthikeyann to separate the MR used for data generation versus the one used for timed query runs. We need managed memory to avoid OOM in the generator, but we mostly care about async and pool for timed runs.

Copy link
Contributor Author

@karthikeyann karthikeyann Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwendt This brings us to an old question. Right now, we can only pass mr for output values of a libcudf function. All intermediate allocations happen using cudf::get_current_device_resource(). So, if we are targeting larger than GPU memory, the intermediate allocations might run out of GPU memory if the cudf::get_current_device_resource() is not managed memory. Right now, libcudf functions does not have a way pass an mr for intermediate allocations. It's set globally using cudf::set_current_device_resource. Hence cudf::set_current_device_resource is updated here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought that may be the case but wanted to make sure. It seems like we should purposely pass an mr for the returned objects if only for illustration purposes to highlight there are 2 mrs in play here.
Also, it may be worth a detailed comment in the code similar to what you responded here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it to comments.
If we have a way to solve intermediate allocations, that would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, old_mr takes some memory (50% is default) already if it is a pooled memory resource. So, it will cause managed memory resource to spill more often. It's better if that also can be reclaimed until data generation is over. I am still working on this part; a shrink function in pooled memory resource would be great, but not available right now. releasing pool memory would be dangerous (all old allocations become dangling pointers). I am looking into other memory resources for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just request these benchmarks be run using a different default pool (command-line parameter rmm_mode) to start with. These benchmarks are not run in CI. I'm not sure it is worth circumventing the pool since that logic would need to account for the parameter setting the default to not-pool in any case.

std::string rmm_mode{"pool"};

Perhaps you could even check the default mr somehow and if it is set to pool then throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to create managed pool, if the existing mr is not managed or managed_pool.
It has a drawback when pool memory is used, which is default. and data generation may be slow. If that's not acceptable, only way we could fix it is by creating new nvbench_fixture for ndsh benchmarks alone.

@GregoryKimball can we limit rmm_mode to be managed/managed_pool only for running these benchmarks? if we can limit to managed_pool only, no fix required for mr, just use managed_pool or managed mode in cli. Alternatively, if rmm_mode could be anything, and we still want the data generator to be fastest, we should fix this PR by creating a new nvbench_fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GregoryKimball We can run upto SF=30 on 48 GB GPU machine. is that sufficient?
can we merge this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann and I had an offline discussion about the NDS-H-cpp benchmarks. We agreed to collect some data on the max scale factor for the simple single-MR case and the more complex two-MR case.

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Oct 11, 2024

The benchmarked query's runtime did not have any effect due to this PR change. (tested with Q05)

This is overall runtime for single run with data generation. eg. ./benchmark --axis scale_factor=20 --run-once
Old = This PR.
New = new nvbench fixture which delays creation of rmm_mode after data generation. Data generation uses managed pool.
new_ndsh_fixture.patch

Benchmark Old Time SF=10 New Time SF=10 Old Time SF=20 New Time SF=20
NDSH_Q01_NVBENCH 0m10.998s 0m10.798s 0m31.818s 0m18.388s
NDSH_Q05_NVBENCH 0m13.232s 0m12.652s 0m35.439s 0m22.404s
NDSH_Q06_NVBENCH 0m11.227s 0m11.009s 0m31.991s 0m18.114s
NDSH_Q09_NVBENCH 0m15.518s 0m14.674s 0m39.800s 0m26.957s
NDSH_Q10_NVBENCH 0m13.164s 0m12.669s 0m35.183s 0m22.056s

If this data generation time does not matter, we can merge this PR.

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Oct 17, 2024
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Oct 19, 2024

@GregoryKimball

on 48 GB GPU.

TESTED VARIANTS Q05 Max SF
branch-24.12 (async) 16
host side staging (async) 18
hss + managed pool for datagen, (async) 30
hss + chunked pq (async) 40
hss + chunked pq (rmm pool) 27
hss + chunked pq + datagen=managed_pool, (async or pool) 60

The final version in this PR is hss + chunked pq + datagen=managed_pool.

@GregoryKimball
Copy link
Contributor

Thank you @karthikeyann for studying this. It looks like the chunked PQ writer has a big impact as well - thank you for identifying that.

I'm happy to proceed with the current state of hss + chunked pq + datagen=managed_pool, although hss + chunked pq is acceptable as well.

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e7653a7 into rapidsai:branch-24.12 Oct 23, 2024
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue tests Unit testing for project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve scaling of data generation in NDS-H-cpp benchmarks
5 participants