Skip to content

two parts of scaling in happen at different times, leading to delays in task execution #855

@benclifford

Description

@benclifford

Describe the bug

I have noticed this happening at endpoint startup: one part of the code scales in an initially launched block. but another part of the code does not realise it is gone until several minutes later when timeouts happen.

In the period between those two events, no new block is launched to run submitted tasks, and instead they sit delayed until the later realisation that the block is gone.

Here are some logs I added:


1658486467.576714 2022-07-22 12:41:07 INFO Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:930 start Got container switch count: {b'431fcad26ccc': 0}
1658486468.053865 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1139 scale_in Scale in BENC
1658486468.054336 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1168 scale_in BENC: scale in by count of 1 blocks
1658486468.054443 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1174 scale_in BENC: sending hold block to block 1
1658486468.054546 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:564 hold_manager BENC: hold_manager that doesn't actually hold a manager
1658486468.054690 2022-07-22 12:41:08 WARNING Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1181 scale_in BENC: provider cancel 3 - forcibly killing block



1658486587.891700 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:998 start Too many heartbeats missed for manager b'431fcad26ccc'
1658486587.892121 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:1015 start Sent 0 failure reports, unregistering manager b'431fcad26ccc'

Not the two minute delay which i have indicated with new lines.

To Reproduce
launch an endpoint, let the initial block be shut down and then immediately send a task to that endpoint. you should see a delay of several minutes before a new block is launched and task is run.

Expected behavior
Scaling up to run the submitted task should happen immediately.

Environment
Distributed Environment
my dev environment, hacked main a9d70f1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions