Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rounding error in htex block scale in #3721

Merged
merged 8 commits into from
Jan 6, 2025
Merged

Fix rounding error in htex block scale in #3721

merged 8 commits into from
Jan 6, 2025

Conversation

benclifford
Copy link
Collaborator

@benclifford benclifford commented Dec 11, 2024

Description

PR #2196 calculates a number of blocks to scale in, in the htex strategy, rather than scaling in one block per strategy iteration. However, it rounds the wrong way: it scales in a rounded up, rather than rounded down, number of blocks.

Issue #3696 shows that then resulting in oscillating behaviour: With 14 tasks and 48 workers per block, on alternating strategy runs, the code will either scale up to the rounded up number of needed blocks (14/48 => 1), or scale down to the rounded down number of needed blocks (14/48 => 0).

This PR changes the rounding introduced in #2196 to be consistent: rounding up the number of blocks to scale up, and rounding down the number of blocks to scale down.

Changed Behaviour

HTEX scale down should oscillate less

Fixes

Fixes #3696

Type of change

  • Bug fix

@benclifford
Copy link
Collaborator Author

@jrueb not sure if you're still interested, but this PR affects code you introduced so you might want to look at it

@benclifford benclifford marked this pull request as ready for review January 6, 2025 16:26
@benclifford benclifford requested a review from khk-globus January 6, 2025 16:26
Comment on lines -301 to 303
excess_slots = math.ceil(active_slots - (active_tasks * parallelism))
excess_blocks = math.ceil(float(excess_slots) / (tasks_per_node * nodes_per_block))
excess_slots = math.floor(active_slots - (active_tasks * parallelism))
excess_blocks = math.floor(float(excess_slots) / (tasks_per_node * nodes_per_block))
excess_blocks = min(excess_blocks, active_blocks - min_blocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it's accounted for prior to this if-branch — and it exists prior to this hunk — but the first thing that comes to mind is "is it possible for excess_blocks to become negative?)

The if-condition guarantees that active_blocks - min_blocks is at least 1, but I'm not clear on the guarantees of active_slots - (active_tasks * parallelism)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably many of these values are assumed to be "sensible" and this code will go wrong if not. see for example this recent issue #3726

@benclifford benclifford added this pull request to the merge queue Jan 6, 2025
Merged via the queue into master with commit ed80dad Jan 6, 2025
7 checks passed
@benclifford benclifford deleted the benc-3696 branch January 6, 2025 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite loop of scaling in and out with HTEX
2 participants