Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a memory resource based on cudaMallocAsync #4900

Merged
merged 7 commits into from
Jun 13, 2023

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Jun 7, 2023

Category:

New feature (non-breaking change which adds functionality)

Description:

This change adds a new memory resource that uses cudaMallocAsync under the hood.
It can be enabled with an environment variable DALI_USE_CUDA_MALLOC_ASYNC.
Parsing of memory configuration variables is improved and detects contradictory settings.

Additionally, an unnecessary dependence on core/common.h was removed from some very basic header files, sometimes replaced with headers actually required (if any).

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8557789]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8557789]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8557915]: BUILD STARTED

VMM by setting ``DALI_USE_VMM=0``. This will cause ``cudaMalloc`` to be used as an upstream memory
resource for the internal memory pool.

Using ``cudaMallocAsync`` results in slightly slower execution, but it enables memory pool sharing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using ``cudaMallocAsync`` results in slightly slower execution, but it enables memory pool sharing
Using ``cudaMallocAsync`` results in execution time dependent on the CUDA toolkit and driver version, but it enables memory pool sharing

I don't think we can for sure claim that is slower in all cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add some weasel word like "typically" and call it a day.


template <typename Rep, typename Period>
void print_time(std::ostream &os, std::chrono::duration<Rep, Period> time) {
return format_time(seconds(time));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return from void function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

}

template <typename Rep, typename Period>
void print_time(std::ostream &os, std::chrono::duration<Rep, Period> time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os is unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire function is unused. Still, it seems quite useful; I'll move it to some test utils so we don't have to reinvent it.

std::vector<std::thread> threads;
for (int tid = 0; tid < num_threads; tid++) {
threads.emplace_back([&, tid]() {
(void)tid; // Make clang shut up; I prefer to keep this explicitly captured by value, even
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(void)tid; // Make clang shut up; I prefer to keep this explicitly captured by value, even
(void)tid; // Silences clang; I prefer to keep this explicitly captured by value, even

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kilroy Killjoy was here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

       \|||/
       (o o)
----ooO-(_)-Ooo--------

dali/core/mm/perf_test.cu Outdated Show resolved Hide resolved
@JanuszL JanuszL self-assigned this Jun 7, 2023
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8557915]: BUILD FAILED

mzient added 4 commits June 12, 2023 10:26
Signed-off-by: Michal Zientkiewicz <[email protected]>
Signed-off-by: Michal Zientkiewicz <[email protected]>
@mzient mzient force-pushed the cuda_malloc_async_resource branch from be19a11 to a27223d Compare June 12, 2023 11:00
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8610900]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8610900]: BUILD FAILED

@mzient mzient force-pushed the cuda_malloc_async_resource branch from a27223d to 0dfb84a Compare June 12, 2023 11:39
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8611162]: BUILD STARTED

@mzient mzient force-pushed the cuda_malloc_async_resource branch from 7c1ed20 to c19feaf Compare June 12, 2023 11:43
dali/test/timing.h Outdated Show resolved Hide resolved
@@ -235,6 +235,9 @@ inline std::shared_ptr<device_async_resource> CreateDefaultDeviceResource() {
CUDA_CALL(cudaGetDevice(&device_id));
if (MMEnv::get().use_cuda_malloc_async) {
#if CUDA_VERSION >= 11020
if (!cuda_malloc_async_memory_resource::is_supported(device_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if user wants async but it is not supported we want to verbosely fail, right? No silent fallbacks or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it's not enabled by default, so I think we can fail if the user specifically requests this method.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8611162]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8614458]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8614458]: BUILD PASSED

@mzient mzient merged commit a805ced into NVIDIA:main Jun 13, 2023
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8625577]: BUILD STARTED

@JanuszL JanuszL mentioned this pull request Sep 6, 2023
JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023
* Performance test DALI allocator vs cudaMallocAsync.
* Add cuda_malloc_resource. Remove some unnecessary header dependency.
* Improve environment variable handling in default_resources.cc
* Update docs.

---------

Signed-off-by: Michal Zientkiewicz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants