-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Dynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes #635
Comments
I had already started doing this work (just started, not much work yet), And I think it's better to listen to the experts from NV firstly |
I do like the idea of allowing more tasks through based off of the amount of free memory available in the system, but I do have a few concerns with the approach. The first is fragmentation, which we are working hard to find solutions to. Having 5GB free does not mean that all 5GB are usable by another task. There has been discussion about exposing a metric from RMM, something like largest possible allocation, to give us some kind of an idea about how bad fragmentation is. That way we could take that into account too. Second is that we use a BufferManager currently as a part of the UCX shuffle work, and we plan on using it more in the future for general processing, so we can hold more data on the GPU, but only spill it when we have to. You would have to find a way to take this into account in deciding what can go on the GPU and what cannot, what is more because we try to fill up the GPU it can also interfere with our measure of fragmentation, so we might need some statistics from the BufferManager about what is spillable and how big those buffers are. My final concern is that memory allocation is all or nothing in many cases and predicting how much operating memory is needed can be very hard. For example there is a join in the plan. In one case nothing matches and there is no output. In another case everything matches badly and we get a combinitorial explosion. If there is not enough memory free for the later case the task fails. If we are constantly trying to push the GPU to the limits of what it can do we are going to hit these types of situations more and more. |
@JustPlay that issue alone does not solve fragmentation problems. It's designed to target the particularly bad fragmentation that can occur when using multiple streams, as without it each stream is essentially fragmenting the space from the perspective of other streams. We have seen significant fragmentation in some cases with only a single stream (e.g.: cannot allocate 1.8GB when only 1.7GB of total GPU memory is allocated). Resolving that issue will not address those scenarios. |
While I may add tracking of the largest available block to the pool memory resource, knowing the largest possible allocation requires syncing and merging all streams' free lists, which is not what you likely want to do until you fail to allocate. |
No, i think you can |
Also because the memory usage is changeing all the time this number will likely be out of date the instant it is reported. I am a bit skeptical that a point in time measurement is going to give us what we want, but at least it would be a starting point. We can iterate on rolling average memory usage and other things if we see some kind of improvement from the point in time metrics. |
I don't think you got my point. The largest current free block is not the same as largest possible allocation. Knowing the latter requires merging all free lists to coalesce blocks from multiple streams. |
…IDIA#635) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>
* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> * Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633) * fixing empty columns Signed-off-by: Mike Wilson <[email protected]> * cudf submodule commit to v22.10.00 (NVIDIA#640) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * try use new token to fix automerge permission * verify automerge fix of Token permission (NVIDIA#643) Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Revert not working automerge fix [skip ci] (NVIDIA#644) * Revert "verify automerge fix of Token permission (NVIDIA#643)" This reverts commit 8261117. * Revert "try use new token to fix automerge permission" This reverts commit 2a9acde. Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]> * Auto-merge use submodule in BASE ref Signed-off-by: Peixin Li <[email protected]> Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: Mike Wilson <[email protected]> Signed-off-by: Peixin Li <[email protected]> Co-authored-by: Jenkins Automation <[email protected]> Co-authored-by: Mike Wilson <[email protected]>
Is your feature request related to a problem? Please describe.
(A). In the current spark-rapids (0.1 and 0.2),
spark.rapids.sql.concurrentGpuTasks
andspark.rapids.sql.batchSizeBytes
are static and across through the whole query (and may even across multi-query in some benchmark workload), and what's more, it's difficult for end-user to find a good setting values (particular when the user had to run multiple queries with very different runtime patten)(B). Radical values for those two conf may lead to OOM,but conservative values will put the GPU into an inefficient state (low efficiency)
(C). Different query, even different query-stage in the same query have very different runtime pattern, so we need different
concurrent
andbatch-size
setting. but we can't change by using the--conf
command line options (e.g. when we running TPC-DS benchmark, we can only give a fixed set of command line --conf when we submit the job with 103 queries)(D). So, rapids need the ability of
dynamically change those two options at runtime (transparent to users)
, which i call itDynamic spark.rapids.sql.concurrentGpuTasks and spark.rapids.sql.batchSizeBytes
Describe the solution you'd like
For
spark.rapids.sql.batchSizeBytes
, I think we can rely on the AQE framework, for different query-stage, we set different value tospark.rapids.sql.batchSizeBytes
based on the statistical information from the upstream stage (for this I do not have more detailed idea, But i think AQE is a good start)For
spark.rapids.sql.concurrentGpuTasks
, I think we can do some thing by extend theGpuSemaphore
wait-list
By using
dymamic concurrentGpuTasks
, the user can give a bigger concurrent setting for better perf for most query,and for some corner query that will OOM when concurrent is bigger,rapids will dynamic limit the concurrent to minimum the OOM riscBased on my understanding of the rapids and rmm code , three thing need to be done:
I think the introduced overhead will be negligible,because we can let the task first acquire the gpu-semaphore then do the judgement
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The text was updated successfully, but these errors were encountered: