Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cuDF merge benchmark #867

Merged
merged 14 commits into from
Aug 10, 2022
Merged

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Aug 4, 2022

This adds several bugfixes and improvements:

  • Correctly assign CUDA_VISIBLE_DEVICES;
  • Create CUDA contexts before initializing UCX;
  • Ensure merge uses the "key" column and asserts expected result size;
  • Allow running the benchmark for multiple iterations;
  • Allow running multiple warmup iterations;
  • Prints results for each iteration;
  • Allow enabling garbage collection after each itearation.

@pentschev pentschev marked this pull request as ready for review August 5, 2022 08:57
@pentschev pentschev requested a review from a team as a code owner August 5, 2022 08:57
@pentschev
Copy link
Member Author

Since this benchmark is broken due to changes in CUDA context handling, it would be good if this still makes 0.27 (22.08), despite the ongoing burndown. As this is only a benchmark it should not impact release in any way, as this is not executed anywhere as part of that process.

@wence-
Copy link
Contributor

wence- commented Aug 5, 2022

Since this benchmark is broken due to changes in CUDA context handling

Can you explain what you mean by this?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpicks, otherwise looks good.

benchmarks/cudf-merge.py Show resolved Hide resolved
benchmarks/cudf-merge.py Outdated Show resolved Hide resolved
benchmarks/cudf-merge.py Outdated Show resolved Hide resolved
benchmarks/cudf-merge.py Show resolved Hide resolved
benchmarks/cudf-merge.py Show resolved Hide resolved
benchmarks/cudf-merge.py Show resolved Hide resolved
ucp/utils.py Show resolved Hide resolved
@pentschev
Copy link
Member Author

Since this benchmark is broken due to changes in CUDA context handling

Can you explain what you mean by this?

cuDF creates a CUDA context at import time, which wasn't the case in the past, and is the reason we have rapidsai/dask-cuda#379 in Dask-CUDA, if we don't set CUDA_VISIBLE_DEVICES then a context will always be created on device 0. Added to that, we now rely on UCX to ensure IB<->GPU mapping automatically, but that requires the CUDA context to be created before import ucp.

Change clock to `time.monotonic()`, to prevent issues with clock
going backwards in some systems.
benchmarks/cudf-merge.py Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Aug 5, 2022

FWIW, the only niggle I have left here is that the --chunks-per-dev argument is a bit misnamed. The total number of workers started up is chunks-per-dev * num_devices (so when running with all GPUs and more than chunks-per-dev > 1 we get errors because we try and build 2 (or more) RMM pools per device and things break).

@pentschev
Copy link
Member Author

FWIW, the only niggle I have left here is that the --chunks-per-dev argument is a bit misnamed. The total number of workers started up is chunks-per-dev * num_devices (so when running with all GPUs and more than chunks-per-dev > 1 we get errors because we try and build 2 (or more) RMM pools per device and things break).

The RMM pool being a problem is expected with the default value, that's when the user will have to adjust --rmm-init-pool-size. So --chunks-per-dev still sounds reasonable, it would be misnamed if it was --chunks-per-process or --chunks-per-worker. Do you have a proposal on how to make this better?

@wence-
Copy link
Contributor

wence- commented Aug 5, 2022

FWIW, the only niggle I have left here is that the --chunks-per-dev argument is a bit misnamed. The total number of workers started up is chunks-per-dev * num_devices (so when running with all GPUs and more than chunks-per-dev > 1 we get errors because we try and build 2 (or more) RMM pools per device and things break).

The RMM pool being a problem is expected with the default value, that's when the user will have to adjust --rmm-init-pool-size. So --chunks-per-dev still sounds reasonable, it would be misnamed if it was --chunks-per-process or --chunks-per-worker. Do you have a proposal on how to make this better?

I had naively thought that we would get num_devices workers and chunks-per-dev would say "how many chunks does each worker have". But actually, each worker only ever has a single chunk, and we always have num_devices * chunks_per_device workers (i.e. we oversubscribe the GPUs).

Since this isn't a substantive change from the current behaviour, let's not pollute this PR with too many additional pieces (sorry, this is my fault for noticing!) and leave as-is for now.

benchmarks/cudf-merge.py Outdated Show resolved Hide resolved
benchmarks/cudf-merge.py Outdated Show resolved Hide resolved
pentschev and others added 2 commits August 9, 2022 16:03
@pentschev
Copy link
Member Author

I think all changes are in now, could you check one last time/approve the PR @wence- ?

@wence-
Copy link
Contributor

wence- commented Aug 10, 2022

@gpucibot merge

@wence-
Copy link
Contributor

wence- commented Aug 10, 2022

I am happy to merge this, though does it need further approval to go to 0.27 (rather than 0.28) ?

@pentschev
Copy link
Member Author

I am happy to merge this, though does it need further approval to go to 0.27 (rather than 0.28) ?

UCX-Py doesn't follow exactly the same conditions as the remaining of RAPIDS, and given there are no other required reviewers, we should be good to merge. Also gpucibot has no power here. :)

@wence-
Copy link
Contributor

wence- commented Aug 10, 2022

OK, let's wait for tests.

@pentschev
Copy link
Member Author

For the very small benchmark in CI, the assertion failed:

$ python cudf-merge.py --chunks-per-dev 4 --chunk-size 10000 --rm
m-init-pool-size 2097152
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/rn-220803/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/pentschev/miniconda3/envs/rn-220803/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/pentschev/miniconda3/envs/rn-220803/lib/python3.8/site-packages/ucp/utils.py", line 131, in _worker_process
    ret = loop.run_until_complete(run())
  File "/datasets/pentschev/miniconda3/envs/rn-220803/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/datasets/pentschev/miniconda3/envs/rn-220803/lib/python3.8/site-packages/ucp/utils.py", line 126, in run
    return await func(rank, eps, args)
  File "/datasets/pentschev/src/ucx-py/benchmarks/cudf-merge.py", line 216, in worker
    assert abs(len(ret) - expected_len) <= expected_len_err

Maximum error was 30, and the actual value was 31. Since this is not critical, I increased the tolerance to 2% now.

@pentschev
Copy link
Member Author

And it seems I was wrong. We have to ask for approval.

Copy link
Contributor

@msadang msadang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msadang msadang merged commit e9e81f8 into rapidsai:branch-0.27 Aug 10, 2022
@pentschev pentschev deleted the cudf-merge-update branch August 24, 2022 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants