You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed this happening at endpoint startup: one part of the code scales in an initially launched block. but another part of the code does not realise it is gone until several minutes later when timeouts happen.
In the period between those two events, no new block is launched to run submitted tasks, and instead they sit delayed until the later realisation that the block is gone.
Here are some logs I added:
1658486467.576714 2022-07-22 12:41:07 INFO Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:930 start Got container switch count: {b'431fcad26ccc': 0}
1658486468.053865 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1139 scale_in Scale in BENC
1658486468.054336 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1168 scale_in BENC: scale in by count of 1 blocks
1658486468.054443 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1174 scale_in BENC: sending hold block to block 1
1658486468.054546 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:564 hold_manager BENC: hold_manager that doesn't actually hold a manager
1658486468.054690 2022-07-22 12:41:08 WARNING Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1181 scale_in BENC: provider cancel 3 - forcibly killing block
1658486587.891700 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:998 start Too many heartbeats missed for manager b'431fcad26ccc'
1658486587.892121 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:1015 start Sent 0 failure reports, unregistering manager b'431fcad26ccc'
Not the two minute delay which i have indicated with new lines.
To Reproduce
launch an endpoint, let the initial block be shut down and then immediately send a task to that endpoint. you should see a delay of several minutes before a new block is launched and task is run.
Expected behavior
Scaling up to run the submitted task should happen immediately.
Environment Distributed Environment
my dev environment, hacked maina9d70f1
The text was updated successfully, but these errors were encountered:
benclifford
changed the title
two parts of scaling in happen at different times, leading to delays
two parts of scaling in happen at different times, leading to delays in task execution
Jul 22, 2022
Describe the bug
I have noticed this happening at endpoint startup: one part of the code scales in an initially launched block. but another part of the code does not realise it is gone until several minutes later when timeouts happen.
In the period between those two events, no new block is launched to run submitted tasks, and instead they sit delayed until the later realisation that the block is gone.
Here are some logs I added:
Not the two minute delay which i have indicated with new lines.
To Reproduce
launch an endpoint, let the initial block be shut down and then immediately send a task to that endpoint. you should see a delay of several minutes before a new block is launched and task is run.
Expected behavior
Scaling up to run the submitted task should happen immediately.
Environment
Distributed Environment
my dev environment, hacked
main
a9d70f1The text was updated successfully, but these errors were encountered: