Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop of scaling in and out with HTEX #3696

Open
stevenstetzler opened this issue Nov 14, 2024 · 10 comments · May be fixed by #3721
Open

Infinite loop of scaling in and out with HTEX #3696

stevenstetzler opened this issue Nov 14, 2024 · 10 comments · May be fixed by #3721
Labels

Comments

@stevenstetzler
Copy link

Describe the bug
I've encountered an infinite loop of scaling in and out with the high throughput executor. Blocks get scaled out only to be immediately scaled in as idle, followed by blocks getting scaled out again.

I believe the issue is due to the block info in the high throughput executor scale in logic having infinite idle time (per their initialization). This seems related to #3353 where I can imagine un-started blocks may not report an idle time.

The logic is here: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L706-L747

        @dataclass
        class BlockInfo:
            tasks: int  # sum of tasks in this block
            idle: float  # shortest idle time of any manager in this block

        # block_info will be populated from two sources:
        # the Job Status Poller mutable block list, and the list of blocks
        # which have connected to the interchange.

        def new_block_info():
            return BlockInfo(tasks=0, idle=float('inf'))

        block_info: Dict[str, BlockInfo] = defaultdict(new_block_info)

        for block_id, job_status in self._status.items():
            if job_status.state not in TERMINAL_STATES:
                block_info[block_id] = new_block_info()

        managers = self.connected_managers()
        for manager in managers:
            if not manager['active']:
                continue
            b_id = manager['block_id']
            block_info[b_id].tasks += manager['tasks']
            block_info[b_id].idle = min(block_info[b_id].idle, manager['idle_duration'])

        # The scaling policy is that longest idle blocks should be scaled down
        # in preference to least idle (most recently used) blocks.
        # Other policies could be implemented here.

        sorted_blocks = sorted(block_info.items(), key=lambda item: (-item[1].idle, item[1].tasks))

        logger.debug(f"Scale in selecting from {len(sorted_blocks)} blocks")
        if max_idletime is None:
            block_ids_to_kill = [x[0] for x in sorted_blocks[:blocks]]
        else:
            block_ids_to_kill = []
            for x in sorted_blocks:
                if x[1].idle > max_idletime and x[1].tasks == 0:
                    block_ids_to_kill.append(x[0])
                    if len(block_ids_to_kill) == blocks:
                        break

and this line in particular is the culprit: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L744

                if x[1].idle > max_idletime and x[1].tasks == 0:

Note that new_block_info() initializes idle to float('inf') and if that default value is used in comparison with max_idletime it will always return true and be appended to the block ids to kill.

A one-line fix to

                if x[1].idle != float('inf') and x[1].idle > max_idletime and x[1].tasks == 0:

has fixed the behavior for me.

To Reproduce
I'm not sure how to reproduce this consistently, as it only appeared when moving to a new computing environment.

Expected behavior
I expect scaling to behave as normal.

Environment

  • OS: Rocky Linux 9
  • Python version: 3.11.8
  • Parsl version: 1.3.0-dev

Distributed Environment

  • Compute node
  • Compute nodes
@benclifford
Copy link
Collaborator

That code is in scale_in which should only be called by the scaling strategy code when it has decided that there are too many blocks, and in that case the newest unused (infinitely-idle) blocks should be scaled in (deliberately in preference to any that are already running - although that choice is separately debatable).

What I'm interested in here, then, is why scale_in gets called. That gets called in several places in the scaling strategy, for example here:

executor.scale_in_facade(active_blocks - min_blocks)

Usually that happens when there isn't enough load (outstanding tasks) to need that number of blocks to exist.

So maybe you can have a look at what is happening around there - there are a bunch of debug log messages in strategy.py that are intended to help understand why parsl is making scaling decisions.

@stevenstetzler
Copy link
Author

@benclifford I took a closer look and think I have tracked it down. In the logs I see,

DEBUG: Strategy case 4b: more slots than tasks

i.e. there are more slots than tasks. Active slots includes pending blocks per

running = sum([1 for x in status.values() if x.state == JobState.RUNNING])
pending = sum([1 for x in status.values() if x.state == JobState.PENDING])
active_blocks = running + pending
active_slots = active_blocks * tasks_per_node * nodes_per_block

However PENDING JobStates are included in blocks to be considered for scale in since JobStates.PENDING is not in TERMINAL_STATES

def new_block_info():
return BlockInfo(tasks=0, idle=float('inf'))
block_info: Dict[str, BlockInfo] = defaultdict(new_block_info)
for block_id, job_status in self._status.items():
if job_status.state not in TERMINAL_STATES:
block_info[block_id] = new_block_info()

So, scale_in is called since active_slots > active_tasks. The only active slots come from a pending block. During scale in, a BlockInfo is added with inf idle time. Since this block is pending it isn't included in connected_managers() (in which case the inf would be cleared with the true block idle time)

managers = self.connected_managers()
for manager in managers:
if not manager['active']:
continue
b_id = manager['block_id']
block_info[b_id].tasks += manager['tasks']
block_info[b_id].idle = min(block_info[b_id].idle, manager['idle_duration'])

The pending block is then selected for scale in and cancelled.

This is followed by a scale out and causes a loop that isn't resolved unless the block becomes available before the scaling strategy loop occurs again.

@benclifford
Copy link
Collaborator

@stevenstetzler this should be converging towards having the "correct" number of slots though: if there are more slots than tasks then (modulo some rounding problem?) blocks should be cancelled - whether they are active or pending.

The intention is that the number of slots (active or pending) converges towards the number of tasks pending or active.

What does this log line say, which is the raw data for the "more slots than tasks" decision?

logger.debug(f"Slot ratio calculation: active_slots = {active_slots}, active_tasks = {active_tasks}")

Even better, can you send me a complete parsl.log for one of these runs?

@benclifford
Copy link
Collaborator

(the question I have is not about why pending blocks are scaled in - it is why any blocks are being scaled in, if there is enough task load to scale them out 5 seconds earlier)

@stevenstetzler
Copy link
Author

Here is the parsl.log
parsl.log

You will see

1733887247.182073 2024-12-10 21:20:47 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887252.266406 2024-12-10 21:20:52 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14
1733887261.421944 2024-12-10 21:21:01 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887266.854923 2024-12-10 21:21:06 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14
1733887272.831117 2024-12-10 21:21:12 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887278.464345 2024-12-10 21:21:18 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14

as the block gets scaled in and out.

@benclifford
Copy link
Collaborator

ok, that's interesting. looks more like it's oscillating around the convergence point (of 14 slots) rather than converging to a fixed number. let me see if I can reproduce this in a test locally.

@benclifford
Copy link
Collaborator

ok, here's what I think is a reproducer https://github.com/Parsl/parsl/tree/benc-3696 - there is some suspicious rounding in the code that chooses how to head towards the target number of blocks. I'll flesh out some more testing and hopefully it is then a simple fix.

@benclifford benclifford linked a pull request Dec 11, 2024 that will close this issue
@benclifford
Copy link
Collaborator

@stevenstetzler can you try out the fix in PR #3721?

@stevenstetzler
Copy link
Author

@benclifford I encountered the error in one workflow as a part of a large number of workflows for in-progress data processing. I've already executed the workflow using my hack fix and with my set up, I can't re-run the exact workflow again. Would you like me to test it out in another way?

@benclifford
Copy link
Collaborator

@stevenstetzler no worries if you can't easily reproduce it - I'm fairly certain about #3721 fixing some bug similar to what you are seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants