Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow concurrentGpuTasks and possibly other configs to be dynamically set #1399

Closed
tgravescs opened this issue Dec 15, 2020 · 5 comments
Labels
feature request New feature or request

Comments

@tgravescs
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Currently some of our configs can't always be dynamic changed. For instance spark.rapids.sql.concurrentGpuTasks is read on executor startup and initialized, so generally it has to be set on startup. If you are using dynamic allocation it may pick up changes to it which could be very confusing. There are other configs like this like memory %.

Some environments make it hard to change this as you have to modify init scripts or restart entire clusters - like EMR and Databricks. So we should investigate a way to make it so we can dynamically change these configs. For instance read the new config value every so often and see if it has changed. Then we have to take into account if its getting larger or smaller as making it smaller we may need to drain things first.

@tgravescs tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 15, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 15, 2020
@chenrui17
Copy link

#635

@jlowe
Copy link
Contributor

jlowe commented Jan 4, 2021

@chenrui17 the headlines imply these two are the same, but I think this is a bit different than what is being proposed in #635. The other issue is asking for automatic, dynamic scaling of the concurrent tasks based on GPU memory usage. This issue instead is asking for something that should be much simpler to implement - the ability to manually adjust the concurrent tasks setting at runtime. If #635 can be satisfied by implementing that then I agree this can be closed as a duplicate.

@revans2
Copy link
Collaborator

revans2 commented Jan 17, 2023

Because of the dupe I had an idea on how to implement this. #7521 (comment)

On paper this looks to be really hard because the semaphore that implements this is shared by multiple different tasks that may belong to different jobs. So if one task asks for a parallelism of 2 and another task asks for a parallelism of 1 how do we make them both play nicely with each other. Looking at the semaphore API I think we can make this work. If we think of each request being a percentage of the GPU instead a hard coded concurrency, then we can have each request acquire a proportional number of permits instead of just 1.

So for example if we initially set the number of permits to 100 and the task asks for the semaphore with a concurrently of 1, it would request all 100 permits and return all 100 when it finishes. If the task has a concurrency of 3 and requests the semaphore it would request 33 of the permits.

The other part to this is that in practice we really never have multiple different jobs trying to share the GPU so it is unlikely to be a problem.

@revans2 revans2 added the ? - Needs Triage Need team to review and classify label Jan 17, 2023
@revans2
Copy link
Collaborator

revans2 commented Jan 17, 2023

Added Needs Triage so we can look at this again now that we have a customer that really needs this to go into production.

@revans2 revans2 removed the ? - Needs Triage Need team to review and classify label Jan 19, 2023
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
@jlowe
Copy link
Contributor

jlowe commented Jan 24, 2024

Fixed by #7527.

@jlowe jlowe closed this as completed Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants