-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471
Labels
Comments
testing a fix with @svandenhaute |
benclifford
added a commit
that referenced
this issue
Jul 16, 2024
The intended behaviour of this scale in code, which is only for scaling in all blocks (for example at the end of a workflow) makes sense as a default for all BlockProviderExecutors. This PR makes that refactor. This code is buggy (before and after) - see issue #3471. This PR does not attempt to fix that, but moves code into a better place for bugfixing, and a subsequent PR will fix it.
benclifford
added a commit
that referenced
this issue
Jul 24, 2024
The intended behaviour of this scale in code, which is only for scaling in all blocks (for example at the end of a workflow) makes sense as a default for all BlockProviderExecutors. This PR makes that refactor. This code is buggy (before and after) - see issue #3471. This PR does not attempt to fix that, but moves code into a better place for bugfixing, and a subsequent PR will fix it.
benclifford
added a commit
that referenced
this issue
Aug 1, 2024
In the BlockProviderExecutor, the block ID to job ID mapping structures contain the full historical list of blocks. Prior to this PR, the mapping was used as source of current jobs that should/could be scaled in. This was incorrect. and resulted in scaling in code attempting to: scale in blocks that had already finished, because it continues to see those blocks as eligible for scale-in not scale in blocks that were active - because rather than choosing to scale in an alive block, the code would choose to attempt to scale in a non-alive block After this PR, the _status structure which should contain reasonably up to date status information is used instead of the block/job ID mapping structures. (as a more general principle, those block/job ID mapping structures should never be examined as a whole but only used for mapping) Changed Behaviour: Scaling in should work better in executors using the default scaling in that was refactored in PR #3526, which right now is Work Queue and Task Vine. Fixes #3471
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Sometimes work queue scale-in will generate a log message like this on repeated strategy runs, and will not scale in blocks it should be scaling in.
@svandenhaute reported this on his cluster and I have recreated it on perlmutter.
It looks like this comes from
WorkQueueExecutor.scale_in
choosing block IDs fromself.blocks_to_job_id
which lists all block IDs that have ever existed, not active+pending status blocks that should be used here.History
PR #3308 renamed that structure which makes this bug a bit more obvious to diagnose. Pior to that PR,
self.blocks_to_job_id
was calledself.blocks
.This code was copied from
HighThroughputExecutor
, butHighThroughputExecutor
now uses block IDs from self._status and from the interchange, rather than from self.blocks_to_job_id so should not have this same problem.TaskVineExecutor
also copies this code and likely also shows this problem, and this issue should not be closed until both TaskVine and WorkQueue executors are fixed.To Reproduce
This program on perlmutter will scale up and down in the right ways to demonstrate this bug:
Expected behavior
scaling should wor
Environment
perlmutter, configured as above, on a slightly hacked up fork of bf98e50
The text was updated successfully, but these errors were encountered: