-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop of scaling in and out with HTEX #3696
Comments
That code is in What I'm interested in here, then, is why scale_in gets called. That gets called in several places in the scaling strategy, for example here: Line 256 in 92ab47f
Usually that happens when there isn't enough load (outstanding tasks) to need that number of blocks to exist. So maybe you can have a look at what is happening around there - there are a bunch of debug log messages in |
@benclifford I took a closer look and think I have tracked it down. In the logs I see,
i.e. there are more slots than tasks. Active slots includes pending blocks per Lines 209 to 212 in 92ab47f
However PENDING JobStates are included in blocks to be considered for scale in since JobStates.PENDING is not in TERMINAL_STATES parsl/parsl/executors/high_throughput/executor.py Lines 715 to 722 in 92ab47f
So, parsl/parsl/executors/high_throughput/executor.py Lines 724 to 730 in 92ab47f
The pending block is then selected for scale in and cancelled. This is followed by a scale out and causes a loop that isn't resolved unless the block becomes available before the scaling strategy loop occurs again. |
@stevenstetzler this should be converging towards having the "correct" number of slots though: if there are more slots than tasks then (modulo some rounding problem?) blocks should be cancelled - whether they are active or pending. The intention is that the number of slots (active or pending) converges towards the number of tasks pending or active. What does this log line say, which is the raw data for the "more slots than tasks" decision? logger.debug(f"Slot ratio calculation: active_slots = {active_slots}, active_tasks = {active_tasks}") Even better, can you send me a complete |
(the question I have is not about why pending blocks are scaled in - it is why any blocks are being scaled in, if there is enough task load to scale them out 5 seconds earlier) |
Here is the You will see
as the block gets scaled in and out. |
ok, that's interesting. looks more like it's oscillating around the convergence point (of 14 slots) rather than converging to a fixed number. let me see if I can reproduce this in a test locally. |
ok, here's what I think is a reproducer https://github.com/Parsl/parsl/tree/benc-3696 - there is some suspicious rounding in the code that chooses how to head towards the target number of blocks. I'll flesh out some more testing and hopefully it is then a simple fix. |
@stevenstetzler can you try out the fix in PR #3721? |
@benclifford I encountered the error in one workflow as a part of a large number of workflows for in-progress data processing. I've already executed the workflow using my hack fix and with my set up, I can't re-run the exact workflow again. Would you like me to test it out in another way? |
@stevenstetzler no worries if you can't easily reproduce it - I'm fairly certain about #3721 fixing some bug similar to what you are seeing. |
Describe the bug
I've encountered an infinite loop of scaling in and out with the high throughput executor. Blocks get scaled out only to be immediately scaled in as idle, followed by blocks getting scaled out again.
I believe the issue is due to the block info in the high throughput executor scale in logic having infinite idle time (per their initialization). This seems related to #3353 where I can imagine un-started blocks may not report an idle time.
The logic is here: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L706-L747
and this line in particular is the culprit: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L744
Note that
new_block_info()
initializesidle
tofloat('inf')
and if that default value is used in comparison withmax_idletime
it will always return true and be appended to the block ids to kill.A one-line fix to
has fixed the behavior for me.
To Reproduce
I'm not sure how to reproduce this consistently, as it only appeared when moving to a new computing environment.
Expected behavior
I expect scaling to behave as normal.
Environment
Distributed Environment
The text was updated successfully, but these errors were encountered: