Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635

Open
JustPlay opened this issue Sep 1, 2020 · 8 comments
Labels
feature request New feature or request

Comments

@JustPlay
Copy link

JustPlay commented Sep 1, 2020

Is your feature request related to a problem? Please describe.
(A). In the current spark-rapids (0.1 and 0.2), spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes are static and across through the whole query (and may even across multi-query in some benchmark workload), and what's more, it's difficult for end-user to find a good setting values (particular when the user had to run multiple queries with very different runtime patten)
(B). Radical values for those two conf may lead to OOM,but conservative values will put the GPU into an inefficient state (low efficiency)
(C). Different query, even different query-stage in the same query have very different runtime pattern, so we need different concurrent and batch-size setting. but we can't change by using the --conf command line options (e.g. when we running TPC-DS benchmark, we can only give a fixed set of command line --conf when we submit the job with 103 queries)
(D). So, rapids need the ability of dynamically change those two options at runtime (transparent to users), which i call it Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes

Describe the solution you'd like
For spark.rapids.sql.batchSizeBytes, I think we can rely on the AQE framework, for different query-stage, we set different value to spark.rapids.sql.batchSizeBytes based on the statistical information from the upstream stage (for this I do not have more detailed idea, But i think AQE is a good start)

For spark.rapids.sql.concurrentGpuTasks, I think we can do some thing by extend the GpuSemaphore

  1. when a task want to acquire gpu,we can dynamic judge whether it can be allowed to use the gpu based-on the gpu memory pressure (i.e. total memory, free memory, and the number of tasks currently running on gpu or use some history information)
  2. when we judge that the free gpu memory is enough, we then put the task on gpu, if not, we put the task into a wait-list
  3. when other task release gpu or exit, we weak up one or more task that waiting in the wait-list

By using dymamic concurrentGpuTasks, the user can give a bigger concurrent setting for better perf for most query,and for some corner query that will OOM when concurrent is bigger,rapids will dynamic limit the concurrent to minimum the OOM risc

Based on my understanding of the rapids and rmm code , three thing need to be done:

  1. impl the get_mem_info (or other API) in RMM's pool_memory_resource
  2. add some code in the RMM JNI code in rapids
  3. extends rapids's GpuSemaphore (add some counter,add a judge logic based on memory pressure,add a wait-list)

I think the introduced overhead will be negligible,because we can let the task first acquire the gpu-semaphore then do the judgement

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@JustPlay JustPlay added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 1, 2020
@JustPlay
Copy link
Author

JustPlay commented Sep 1, 2020

I had already started doing this work (just started, not much work yet), And I think it's better to listen to the experts from NV firstly

@revans2
Copy link
Collaborator

revans2 commented Sep 1, 2020

I do like the idea of allowing more tasks through based off of the amount of free memory available in the system, but I do have a few concerns with the approach.

The first is fragmentation, which we are working hard to find solutions to. Having 5GB free does not mean that all 5GB are usable by another task. There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

Second is that we use a BufferManager currently as a part of the UCX shuffle work, and we plan on using it more in the future for general processing, so we can hold more data on the GPU, but only spill it when we have to. You would have to find a way to take this into account in deciding what can go on the GPU and what cannot, what is more because we try to fill up the GPU it can also interfere with our measure of fragmentation, so we might need some statistics from the BufferManager about what is spillable and how big those buffers are.

My final concern is that memory allocation is all or nothing in many cases and predicting how much operating memory is needed can be very hard. For example there is a join in the plan. In one case nothing matches and there is no output. In another case everything matches badly and we get a combinitorial explosion. If there is not enough memory free for the later case the task fails. If we are constantly trying to push the GPU to the limits of what it can do we are going to hit these types of situations more and more.

@JustPlay
Copy link
Author

JustPlay commented Sep 2, 2020

rapidsai/rmm#536

@jlowe
Copy link
Contributor

jlowe commented Sep 2, 2020

@JustPlay that issue alone does not solve fragmentation problems. It's designed to target the particularly bad fragmentation that can occur when using multiple streams, as without it each stream is essentially fragmenting the space from the perspective of other streams.

We have seen significant fragmentation in some cases with only a single stream (e.g.: cannot allocate 1.8GB when only 1.7GB of total GPU memory is allocated). Resolving that issue will not address those scenarios.

@harrism
Copy link

harrism commented Sep 3, 2020

There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

While I may add tracking of the largest available block to the pool memory resource, knowing the largest possible allocation requires syncing and merging all streams' free lists, which is not what you likely want to do until you fail to allocate.

@JustPlay
Copy link
Author

JustPlay commented Sep 4, 2020

There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too.

While I may add tracking of the largest available block to the pool memory resource, knowing the largest possible allocation requires syncing and merging all streams' free lists, which is not what you likely want to do until you fail to allocate.

No, i think you can tracking of the largest available block(s) to the pool memory resource without doing stream sync (only give the user a info: there exists one or more block(s), but not guarantee usable immediately), let the sync to de done at the actually allocating time (just let the task wait on the stream)

@revans2
Copy link
Collaborator

revans2 commented Sep 4, 2020

Also because the memory usage is changeing all the time this number will likely be out of date the instant it is reported. I am a bit skeptical that a point in time measurement is going to give us what we want, but at least it would be a starting point. We can iterate on rolling average memory usage and other things if we see some kind of improvement from the point in time metrics.

@harrism
Copy link

harrism commented Sep 6, 2020

I don't think you got my point. The largest current free block is not the same as largest possible allocation. Knowing the latter requires merging all free lists to coalesce blocks from multiple streams.

@revans2 revans2 mentioned this issue Apr 8, 2022
14 tasks
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#635)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>

* Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633)

* fixing empty columns

Signed-off-by: Mike Wilson <[email protected]>

* cudf submodule commit to v22.10.00 (NVIDIA#640)

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* try use new token to fix automerge permission

* verify automerge fix of Token permission (NVIDIA#643)

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* Revert not working automerge fix [skip ci] (NVIDIA#644)

* Revert "verify automerge fix of Token permission (NVIDIA#643)"

This reverts commit 8261117.

* Revert "try use new token to fix automerge permission"

This reverts commit 2a9acde.

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* Auto-merge use submodule in BASE ref

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Signed-off-by: Mike Wilson <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
Co-authored-by: Jenkins Automation <[email protected]>
Co-authored-by: Mike Wilson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants