Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

benclifford · 2024-06-05T11:24:30Z

Describe the bug

Sometimes work queue scale-in will generate a log message like this on repeated strategy runs, and will not scale in blocks it should be scaling in.


1717584031.221508 2024-06-05 03:40:31 MainProcess-256259 JobStatusPoller-Timer-Thread-140411461217168-140411449100032 parsl.process_loggers:30 wrapped ERROR: Exceptional ending for _general_strategy on thread JobStatusPoller-Timer-Thread-140411461217168
Traceback (most recent call last):
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/process_loggers.py", line 26, in wrapped
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/jobs/strategy.py", line 256, in _general_strategy
    executor.scale_in_facade(active_blocks - min_blocks)
  File "/global/u1/b/bxc/tmp/tmp-202406-slurm/parsl/parsl/executors/status_handling.py", line 305, in scale_in_facade
    del self._status[block_id]
        ~~~~~~~~~~~~^^^^^^^^^^
KeyError: '0'

@svandenhaute reported this on his cluster and I have recreated it on perlmutter.

It looks like this comes from WorkQueueExecutor.scale_in choosing block IDs from self.blocks_to_job_id which lists all block IDs that have ever existed, not active+pending status blocks that should be used here.

History

PR #3308 renamed that structure which makes this bug a bit more obvious to diagnose. Pior to that PR, self.blocks_to_job_id was called self.blocks.

This code was copied from HighThroughputExecutor, but HighThroughputExecutor now uses block IDs from self._status and from the interchange, rather than from self.blocks_to_job_id so should not have this same problem.

TaskVineExecutor also copies this code and likely also shows this problem, and this issue should not be closed until both TaskVine and WorkQueue executors are fixed.

To Reproduce

This program on perlmutter will scale up and down in the right ways to demonstrate this bug:


import parsl
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.launchers import SrunLauncher
from parsl.executors import WorkQueueExecutor
from parsl.addresses import address_by_interface

config = Config(
    max_idletime=30,
    executors=[
        WorkQueueExecutor(
            provider=SlurmProvider(
                'debug',  # Partition / QOS
                init_blocks=0,
                min_blocks=0,
                max_blocks=3,
                nodes_per_block=1,
                # string to prepend to #SBATCH blocks in the submit
                scheduler_options='#SBATCH -C cpu',
                # Command to be run before starting a worker
                worker_init='module load python; source activate tmp-202406-slurm',
                # We request all hyperthreads on a node.
                launcher=SrunLauncher(overrides='-c 128'),
                walltime='00:05:00',
                # Slurm scheduler can be slow at times,
                # increase the command timeouts
                cmd_timeout=120,
            ),
        )
    ]
)

with parsl.load(config):


  @parsl.python_app
  def foo():
    import time
    time.sleep(1)


  print("first job")
  foo().result()
 
  print("now idling long enough to scale in")
  import time
  time.sleep(70)

  print("second job")
  foo().result()
 
  print("now idling long enough to scale in")
  import time
  time.sleep(70)


  print("now running another task")
  foo().result()
  print("done")

Expected behavior
scaling should wor

Environment

perlmutter, configured as above, on a slightly hacked up fork of bf98e50

The text was updated successfully, but these errors were encountered:

benclifford · 2024-06-07T09:57:46Z

testing a fix with @svandenhaute

The intended behaviour of this scale in code, which is only for scaling in all blocks (for example at the end of a workflow) makes sense as a default for all BlockProviderExecutors. This PR makes that refactor. This code is buggy (before and after) - see issue #3471. This PR does not attempt to fix that, but moves code into a better place for bugfixing, and a subsequent PR will fix it.

In the BlockProviderExecutor, the block ID to job ID mapping structures contain the full historical list of blocks. Prior to this PR, the mapping was used as source of current jobs that should/could be scaled in. This was incorrect. and resulted in scaling in code attempting to: scale in blocks that had already finished, because it continues to see those blocks as eligible for scale-in not scale in blocks that were active - because rather than choosing to scale in an alive block, the code would choose to attempt to scale in a non-alive block After this PR, the _status structure which should contain reasonably up to date status information is used instead of the block/job ID mapping structures. (as a more general principle, those block/job ID mapping structures should never be examined as a whole but only used for mapping) Changed Behaviour: Scaling in should work better in executors using the default scaling in that was refactored in PR #3526, which right now is Work Queue and Task Vine. Fixes #3471

benclifford added the bug label Jun 5, 2024

benclifford mentioned this issue Jun 7, 2024

Provide a BlockProviderExecutor default scale-in method #3481

Closed

benclifford mentioned this issue Jul 16, 2024

Refactor naive scale in behaviour for Work Queue and Task Vine #3526

Merged

benclifford mentioned this issue Jul 27, 2024

Only scale in blocks that are in _status and non-terminal #3548

Merged

benclifford closed this as completed in #3548 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

benclifford commented Jun 5, 2024

benclifford commented Jun 7, 2024

Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

Work Queue using Parsl scaling sometimes tries to scale in an old block and then breaks #3471

Comments

benclifford commented Jun 5, 2024

History

benclifford commented Jun 7, 2024