[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

JustPlay · 2020-09-01T09:03:15Z

Is your feature request related to a problem? Please describe.
(A). In the current spark-rapids (0.1 and 0.2), spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes are static and across through the whole query (and may even across multi-query in some benchmark workload), and what's more, it's difficult for end-user to find a good setting values (particular when the user had to run multiple queries with very different runtime patten)
(B). Radical values for those two conf may lead to OOM，but conservative values will put the GPU into an inefficient state (low efficiency)
(C). Different query, even different query-stage in the same query have very different runtime pattern, so we need different concurrent and batch-size setting. but we can't change by using the --conf command line options (e.g. when we running TPC-DS benchmark, we can only give a fixed set of command line --conf when we submit the job with 103 queries)
(D). So, rapids need the ability of dynamically change those two options at runtime (transparent to users), which i call it Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes

Describe the solution you'd like
For spark.rapids.sql.batchSizeBytes, I think we can rely on the AQE framework, for different query-stage, we set different value to spark.rapids.sql.batchSizeBytes based on the statistical information from the upstream stage (for this I do not have more detailed idea, But i think AQE is a good start)

For spark.rapids.sql.concurrentGpuTasks, I think we can do some thing by extend the GpuSemaphore

when a task want to acquire gpu，we can dynamic judge whether it can be allowed to use the gpu based-on the gpu memory pressure (i.e. total memory, free memory, and the number of tasks currently running on gpu or use some history information)
when we judge that the free gpu memory is enough, we then put the task on gpu, if not, we put the task into a wait-list
when other task release gpu or exit, we weak up one or more task that waiting in the wait-list

By using dymamic concurrentGpuTasks， the user can give a bigger concurrent setting for better perf for most query，and for some corner query that will OOM when concurrent is bigger，rapids will dynamic limit the concurrent to minimum the OOM risc

Based on my understanding of the rapids and rmm code , three thing need to be done:

impl the get_mem_info (or other API) in RMM's pool_memory_resource
add some code in the RMM JNI code in rapids
extends rapids's GpuSemaphore (add some counter，add a judge logic based on memory pressure，add a wait-list)

I think the introduced overhead will be negligible，because we can let the task first acquire the gpu-semaphore then do the judgement

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

The text was updated successfully, but these errors were encountered:

JustPlay · 2020-09-01T09:33:18Z

I had already started doing this work (just started, not much work yet), And I think it's better to listen to the experts from NV firstly

revans2 · 2020-09-01T15:29:32Z

I do like the idea of allowing more tasks through based off of the amount of free memory available in the system, but I do have a few concerns with the approach.

The first is fragmentation, which we are working hard to find solutions to. Having 5GB free does not mean that all 5GB are usable by another task. There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

Second is that we use a BufferManager currently as a part of the UCX shuffle work, and we plan on using it more in the future for general processing, so we can hold more data on the GPU, but only spill it when we have to. You would have to find a way to take this into account in deciding what can go on the GPU and what cannot, what is more because we try to fill up the GPU it can also interfere with our measure of fragmentation, so we might need some statistics from the BufferManager about what is spillable and how big those buffers are.

My final concern is that memory allocation is all or nothing in many cases and predicting how much operating memory is needed can be very hard. For example there is a join in the plan. In one case nothing matches and there is no output. In another case everything matches badly and we get a combinitorial explosion. If there is not enough memory free for the later case the task fails. If we are constantly trying to push the GPU to the limits of what it can do we are going to hit these types of situations more and more.

JustPlay · 2020-09-02T10:08:59Z

rapidsai/rmm#536

jlowe · 2020-09-02T14:28:31Z

@JustPlay that issue alone does not solve fragmentation problems. It's designed to target the particularly bad fragmentation that can occur when using multiple streams, as without it each stream is essentially fragmenting the space from the perspective of other streams.

We have seen significant fragmentation in some cases with only a single stream (e.g.: cannot allocate 1.8GB when only 1.7GB of total GPU memory is allocated). Resolving that issue will not address those scenarios.

harrism · 2020-09-03T06:06:41Z

There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

While I may add tracking of the largest available block to the pool memory resource, knowing the largest possible allocation requires syncing and merging all streams' free lists, which is not what you likely want to do until you fail to allocate.

JustPlay · 2020-09-04T14:09:51Z

There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

While I may add tracking of the largest available block to the pool memory resource, knowing the largest possible allocation requires syncing and merging all streams' free lists, which is not what you likely want to do until you fail to allocate.

No, i think you can tracking of the largest available block(s) to the pool memory resource without doing stream sync (only give the user a info: there exists one or more block(s), but not guarantee usable immediately), let the sync to de done at the actually allocating time (just let the task wait on the stream)

revans2 · 2020-09-04T14:21:26Z

Also because the memory usage is changeing all the time this number will likely be out of date the instant it is reported. I am a bit skeptical that a point in time measurement is going to give us what we want, but at least it would be a starting point. We can iterate on rolling average memory usage and other things if we see some kind of improvement from the point in time metrics.

harrism · 2020-09-06T09:38:42Z

I don't think you got my point. The largest current free block is not the same as largest possible allocation. Knowing the latter requires merging all free lists to coalesce blocks from multiple streams.

…IDIA#635) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> * Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633) * fixing empty columns Signed-off-by: Mike Wilson <[email protected]> * cudf submodule commit to v22.10.00 (NVIDIA#640) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * try use new token to fix automerge permission * verify automerge fix of Token permission (NVIDIA#643) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Revert not working automerge fix [skip ci] (NVIDIA#644) * Revert "verify automerge fix of Token permission (NVIDIA#643)" This reverts commit 8261117. * Revert "try use new token to fix automerge permission" This reverts commit 2a9acde. Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Auto-merge use submodule in BASE ref Signed-off-by: Peixin Li <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: Mike Wilson <[email protected]> Signed-off-by: Peixin Li <[email protected]> Co-authored-by: Jenkins Automation <[email protected]> Co-authored-by: Mike Wilson <[email protected]>

JustPlay added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 1, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Sep 1, 2020

JustPlay mentioned this issue Sep 2, 2020

[FEA] I need pool_memory_resource to support get_mem_info() rapidsai/rmm#533

Closed

JustPlay mentioned this issue Sep 2, 2020

[FEA] Add a 4th step: reclaim free blocks from the per-stream freelist(s) and merge them when possible to form bigger free block(s) rapidsai/rmm#536

Closed

chenrui17 mentioned this issue Dec 23, 2020

[FEA] Allow concurrentGpuTasks and possibly other configs to be dynamically set #1399

Closed

revans2 mentioned this issue Apr 8, 2022

[TASK] Big Reliability Epic #1870

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

JustPlay commented Sep 1, 2020 •

edited

Loading

JustPlay commented Sep 1, 2020 •

edited

Loading

revans2 commented Sep 1, 2020

JustPlay commented Sep 2, 2020

jlowe commented Sep 2, 2020

harrism commented Sep 3, 2020

JustPlay commented Sep 4, 2020

revans2 commented Sep 4, 2020

harrism commented Sep 6, 2020

[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

Comments

JustPlay commented Sep 1, 2020 • edited Loading

JustPlay commented Sep 1, 2020 • edited Loading

revans2 commented Sep 1, 2020

JustPlay commented Sep 2, 2020

jlowe commented Sep 2, 2020

harrism commented Sep 3, 2020

JustPlay commented Sep 4, 2020

revans2 commented Sep 4, 2020

harrism commented Sep 6, 2020

JustPlay commented Sep 1, 2020 •

edited

Loading

JustPlay commented Sep 1, 2020 •

edited

Loading