Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

Closed
benclifford opened this issue Jun 5, 2024 · 1 comment · Fixed by #3548
Closed
Labels

Comments

@benclifford
Copy link
Collaborator

Describe the bug

Sometimes work queue scale-in will generate a log message like this on repeated strategy runs, and will not scale in blocks it should be scaling in.


1717584031.221508 2024-06-05 03:40:31 MainProcess-256259 JobStatusPoller-Timer-Thread-140411461217168-140411449100032 parsl.process_loggers:30 wrapped ERROR: Exceptional ending for _general_strategy on thread JobStatusPoller-Timer-Thread-140411461217168
Traceback (most recent call last):
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/process_loggers.py", line 26, in wrapped
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/jobs/strategy.py", line 256, in _general_strategy
    executor.scale_in_facade(active_blocks - min_blocks)
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/executors/status_handling.py", line 305, in scale_in_facade
    del self._status[block_id]
        ~~~~~~~~~~~~^^^^^^^^^^
KeyError: '0'

@svandenhaute reported this on his cluster and I have recreated it on perlmutter.

It looks like this comes from WorkQueueExecutor.scale_in choosing block IDs from self.blocks_to_job_id which lists all block IDs that have ever existed, not active+pending status blocks that should be used here.

History

PR #3308 renamed that structure which makes this bug a bit more obvious to diagnose. Pior to that PR, self.blocks_to_job_id was called self.blocks.

This code was copied from HighThroughputExecutor, but HighThroughputExecutor now uses block IDs from self._status and from the interchange, rather than from self.blocks_to_job_id so should not have this same problem.

TaskVineExecutor also copies this code and likely also shows this problem, and this issue should not be closed until both TaskVine and WorkQueue executors are fixed.

To Reproduce

This program on perlmutter will scale up and down in the right ways to demonstrate this bug:


import parsl
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.launchers import SrunLauncher
from parsl.executors import WorkQueueExecutor
from parsl.addresses import address_by_interface

config = Config(
    max_idletime=30,
    executors=[
        WorkQueueExecutor(
            provider=SlurmProvider(
                'debug',  # Partition / QOS
                init_blocks=0,
                min_blocks=0,
                max_blocks=3,
                nodes_per_block=1,
                # string to prepend to #SBATCH blocks in the submit
                scheduler_options='#SBATCH -C cpu',
                # Command to be run before starting a worker
                worker_init='module load python; source activate tmp-202406-slurm',
                # We request all hyperthreads on a node.
                launcher=SrunLauncher(overrides='-c 128'),
                walltime='00:05:00',
                # Slurm scheduler can be slow at times,
                # increase the command timeouts
                cmd_timeout=120,
            ),
        )
    ]
)

with parsl.load(config):


  @parsl.python_app
  def foo():
    import time
    time.sleep(1)


  print("first job")
  foo().result()
 
  print("now idling long enough to scale in")
  import time
  time.sleep(70)

  print("second job")
  foo().result()
 
  print("now idling long enough to scale in")
  import time
  time.sleep(70)


  print("now running another task")
  foo().result()
  print("done")

Expected behavior
scaling should wor

Environment

perlmutter, configured as above, on a slightly hacked up fork of bf98e50

@benclifford benclifford added the bug label Jun 5, 2024
@benclifford
Copy link
Collaborator Author

testing a fix with @svandenhaute

benclifford added a commit that referenced this issue Jul 16, 2024
The intended behaviour of this scale in code, which is only for scaling in
all blocks (for example at the end of a workflow) makes sense as a default
for all BlockProviderExecutors. This PR makes that refactor.

This code is buggy (before and after) - see issue #3471. This PR does not
attempt to fix that, but moves code into a better place for bugfixing, and
a subsequent PR will fix it.
benclifford added a commit that referenced this issue Jul 24, 2024
The intended behaviour of this scale in code, which is only for scaling in
all blocks (for example at the end of a workflow) makes sense as a default
for all BlockProviderExecutors. This PR makes that refactor.

This code is buggy (before and after) - see issue #3471. This PR does not
attempt to fix that, but moves code into a better place for bugfixing, and
a subsequent PR will fix it.
benclifford added a commit that referenced this issue Aug 1, 2024
In the BlockProviderExecutor, the block ID to job ID mapping structures contain the full historical list of blocks. Prior to this PR, the mapping was used as source of current jobs that should/could be scaled in.

This was incorrect. and resulted in scaling in code attempting to:

scale in blocks that had already finished, because it continues to see those blocks as eligible for scale-in

not scale in blocks that were active - because rather than choosing to scale in an alive block, the code would choose to attempt to scale in a non-alive block

After this PR, the _status structure which should contain reasonably up to date status information is used instead of the block/job ID mapping structures. (as a more general principle, those block/job ID mapping structures should never be examined as a whole but only used for mapping)

Changed Behaviour: 
Scaling in should work better in executors using the default scaling in that was refactored in PR #3526, which right now is Work Queue and Task Vine.

Fixes #3471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant