Skip to content

Using legacy and V3 Partition Manager, some control queues seem to stop processing messages (from DurableTask.Core) #3248

@ericleigh007

Description

@ericleigh007

Description

copied from DurableTask.Core issue, but with updates from latest findings.

The problem is difficult to describe, but it results in an uneven use of partitions, where just a few partition queues back up attempting to service 100s of times more messages than the others, and remain slow in processing. See details below.
The actual inception of whatever the problem is seems to come earlier than it is evident from the queue backup, but so far have not been able to trace it down.
As a result, hapless orchestrators which are serviced in that partition or those partitions run extremely slow, while the same orchestrators that happen to use other partitions run very quickly.

Expected behavior

The expection based on the hashing algorithm and all of the logic meant to keep partitions in balance is that operation should be well balanced and one or a few partitions should not take on all of the load.

Actual behavior

As above, a few partition queues get backed up and process extremely slowly compared to the same code executed on other partitions.

Relevant source code snippets

A small duplication was attempted, but not successful in duplicating the problem. In previous problem reports, I had been able to submit some description on github and the team almost immediately engaged.
I have been using my time attempting to track down the problem, rather than duplicating it.

Known workarounds

If the queues do not get backed up extremely, one can cut off input to the function, however, it appears the slowness of the queeus remains in this case. Turning off the inputs to the function merely lets the queues finally process.

Restarting the function app appears to clear the problem. Unfortunately, in a banking system such as ours with millions of pounds on the losing a transaction by willy-nillly restarting the function app is not an option.

App Details

Minimum scale out of 3
Maximum of 10
Instance size: EP3
Functions V4, .NET Core 8, in-process
Durable task extension 3.4.1 DurableTask.Core 3.3.0, DurableTask.AzureStorage 2.4.0
AzureFunctionsJobHost__Extensions__DurableTask__StorageProvider__PartitionCount: 16
AzureFunctionsJobHost__Extensions__DurableTask__MaxConcurrentActivityFunctions: 100
AzureFunctionsJobHost__Extensions__DurableTask__MaxConcurrentOrchestratorFunctions: 70

Screenshots

A look at the first indication the queue is going to be a problem. These counts are 10x more than counts on other queues, but they recover. There are numerous of these "pulses" before things get very behind.

  2025-11-04T16:50:54.9866255+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 0 (-95) [trend: -19] [19]
  2025-11-04T16:51:15.1658897+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 0 (0) [trend: 0] [0]
  2025-11-04T16:51:29.4721540+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3 (3) [trend: 3] [0]
  2025-11-04T16:51:44.2147253+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 63 (63) [trend: 62] [0.6]
  2025-11-04T16:51:58.7595251+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 394 (394) [trend: 381] [13.2]
  2025-11-04T16:52:13.0058312+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 292 (292) [trend: 200] [92]
  2025-11-04T16:52:33.9949498+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 286 (286) [trend: 136] [150.4]
  2025-11-04T16:52:48.3422946+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 336 (333) [trend: 128] [207.6]
  2025-11-04T16:53:02.6904862+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 47 (-16) [trend: -227] [274.2]
  2025-11-04T16:53:17.5634027+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 68 (-326) [trend: -203] [271]

Later, after a few more of these pulses, things get totally out of control:

  2025-11-04T17:20:28.4478026+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1 (-222) [trend: -242] [243.4]
  2025-11-04T17:20:43.5914407+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 169 (-147) [trend: -30] [199]
  2025-11-04T17:20:58.0324964+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 241 (17) [trend: 71] [169.6]
  2025-11-04T17:21:13.0489883+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 60 (-258) [trend: -113] [173]
  2025-11-04T17:21:35.7022923+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 268 (132) [trend: 147] [121.4]
  2025-11-04T17:21:50.6051107+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 564 (563) [trend: 416] [147.8]
  2025-11-04T17:22:05.0147844+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 684 (515) [trend: 424] [260.4]
  2025-11-04T17:22:19.4276217+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 781 (540) [trend: 418] [363.4]
  2025-11-04T17:22:35.3845189+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 697 (637) [trend: 226] [471.4]
  2025-11-04T17:22:55.6343059+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 757 (489) [trend: 158] [598.8]
  2025-11-04T17:23:09.7646853+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 874 (310) [trend: 177] [696.6]
  2025-11-04T17:23:24.3721123+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1107 (423) [trend: 348] [758.6]
  2025-11-04T17:23:38.6299500+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1366 (585) [trend: 523] [843.2]
  2025-11-04T17:23:52.8382366+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1526 (829) [trend: 566] [960.2]
  2025-11-04T17:24:16.3346551+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1739 (982) [trend: 613] [1126]
  2025-11-04T17:24:30.9831278+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1886 (1012) [trend: 564] [1322.4]
  2025-11-04T17:24:45.4114713+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2062 (955) [trend: 537] [1524.8]
  2025-11-04T17:24:59.4933736+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2503 (1137) [trend: 787] [1715.8]
  2025-11-04T17:25:13.8471385+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2750 (1224) [trend: 807] [1943.2]
  2025-11-04T17:25:34.4050782+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3041 (1302) [trend: 853] [2188]
  2025-11-04T17:25:48.8857546+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2942 (1056) [trend: 494] [2448.4]
  2025-11-04T17:26:03.3328249+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3099 (1037) [trend: 439] [2659.6]
  2025-11-04T17:26:18.2463570+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3108 (605) [trend: 241] [2867]
  2025-11-04T17:26:32.9698856+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3013 (263) [trend: 25] [2988]
  2025-11-04T17:26:55.2890811+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3001 (-40) [trend: -40] [3040.6]
  2025-11-04T17:27:09.8435478+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2856 (-86) [trend: -177] [3032.6]
  2025-11-04T17:27:24.4257139+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2770 (-329) [trend: -245] [3015.4]
  2025-11-04T17:27:38.7124357+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3006 (-102) [trend: 56] [2949.6]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3036 (23) [trend: 107] [2929.2]
  2025-11-04T17:28:14.4188827+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3080 (79) [trend: 146] [2933.8]
  2025-11-04T17:28:28.6993321+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2881 (25) [trend: -69] [2949.6]
  2025-11-04T17:28:43.1389865+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2793 (23) [trend: -162] [2954.6]
  2025-11-04T17:28:57.7000683+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2775 (-231) [trend: -184] [2959.2]
  2025-11-04T17:29:12.1443330+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2721 (-315) [trend: -192] [2913]
  2025-11-04T17:29:31.5420423+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2849 (-231) [trend: -1] [2850]
  2025-11-04T17:29:45.7862360+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3006 (125) [trend: 202] [2803.8]
  2025-11-04T17:30:00.3470387+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2941 (148) [trend: 112] [2828.8]
  2025-11-04T17:30:15.1445325+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2817 (42) [trend: -41] [2858.4]
  2025-11-04T17:30:29.7134456+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 2937 (216) [trend: 70] [2866.8]
  2025-11-04T17:30:53.1430550+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3152 (303) [trend: 242] [2910]
  2025-11-04T17:31:07.5993195+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3376 (370) [trend: 405] [2970.6]
  2025-11-04T17:31:21.8409653+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3510 (569) [trend: 465] [3044.6]
  2025-11-04T17:31:36.7588732+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3577 (760) [trend: 419] [3158.4]
  2025-11-04T17:31:51.8153305+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3901 (964) [trend: 591] [3310.4]
  2025-11-04T17:32:13.2115982+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3918 (766) [trend: 415] [3503.2]
  2025-11-04T17:32:27.8946625+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3600 (224) [trend: -56] [3656.4]
  2025-11-04T17:32:42.5489253+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3492 (-18) [trend: -209] [3701.2]
  2025-11-04T17:32:57.3934579+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3489 (-88) [trend: -209] [3697.6]
  2025-11-04T17:33:13.6410195+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3473 (-428) [trend: -207] [3680]
  2025-11-04T17:33:37.2693835+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3313 (-605) [trend: -281] [3594.4]
  2025-11-04T17:33:51.6407884+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3300 (-300) [trend: -173] [3473.4]
  2025-11-04T17:34:06.1764904+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3484 (-8) [trend: 71] [3413.4]
  2025-11-04T17:34:21.1285992+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3520 (31) [trend: 108] [3411.8]
  2025-11-04T17:34:35.5243652+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3390 (-83) [trend: -28] [3418]
  2025-11-04T17:34:56.5523488+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3482 (169) [trend: 81] [3401.4]
  2025-11-04T17:35:18.2981395+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3189 (-111) [trend: -246] [3435.2]
  2025-11-04T17:35:32.7046332+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3197 (-287) [trend: -216] [3413]
  2025-11-04T17:35:47.3937944+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3042 (-478) [trend: -314] [3355.6]
  2025-11-04T17:36:01.9714981+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3184 (-206) [trend: -76] [3260]
  2025-11-04T17:36:22.6497414+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3191 (-291) [trend: -28] [3218.8]
  2025-11-04T17:36:37.1302955+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3204 (15) [trend: 43] [3160.6]
  2025-11-04T17:36:51.6053969+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3390 (193) [trend: 226] [3163.6]
  2025-11-04T17:37:06.1605615+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3554 (512) [trend: 352] [3202.2]
  2025-11-04T17:37:20.8274151+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3749 (565) [trend: 444] [3304.6]
  2025-11-04T17:37:41.4454873+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3773 (582) [trend: 355] [3417.6]
  2025-11-04T17:37:55.7924656+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3712 (508) [trend: 178] [3534]
  2025-11-04T17:38:10.2553273+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3979 (589) [trend: 343] [3635.6]
  2025-11-04T17:38:24.6693853+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4013 (459) [trend: 260] [3753.4]
  2025-11-04T17:38:38.9517035+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4178 (429) [trend: 333] [3845.2]
  2025-11-04T17:38:59.3939213+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4299 (526) [trend: 368] [3931]
  2025-11-04T17:39:14.1200216+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4528 (816) [trend: 492] [4036.2]
  2025-11-04T17:39:28.9770008+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4577 (598) [trend: 378] [4199.4]
  2025-11-04T17:39:43.4935571+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4442 (429) [trend: 123] [4319]
  2025-11-04T17:39:57.6671210+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4583 (405) [trend: 178] [4404.8]
  2025-11-04T17:40:18.7971872+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4621 (322) [trend: 135] [4485.8]
  2025-11-04T17:40:32.8684828+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4540 (12) [trend: -10] [4550.2]
  2025-11-04T17:40:47.2452232+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4468 (-109) [trend: -85] [4552.6]
  2025-11-04T17:41:01.6039552+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4378 (-64) [trend: -153] [4530.8]
  2025-11-04T17:41:15.8853989+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4550 (-33) [trend: 32] [4518]
  2025-11-04T17:41:36.2585886+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4539 (-82) [trend: 28] [4511.4]
  2025-11-04T17:41:51.0645335+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4637 (97) [trend: 142] [4495]
  2025-11-04T17:42:05.7615788+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4759 (291) [trend: 245] [4514.4]
  2025-11-04T17:42:19.8832755+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4712 (334) [trend: 139] [4572.6]
  2025-11-04T17:42:33.7024657+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 4840 (290) [trend: 201] [4639.4]
  2025-11-04T17:42:55.4670283+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 5197 (658) [trend: 500] [4697.4]
  2025-11-04T17:43:09.6124037+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 5331 (694) [trend: 502] [4829]
  2025-11-04T17:43:23.6914113+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 5227 (468) [trend: 259] [4967.8]
  2025-11-04T17:43:37.8148720+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 5125 (413) [trend: 64] [5061.4]
  2025-11-04T17:43:51.8806799+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 5209 (369) [trend: 65] [5144]

The following shows a major imbalance in mesage complement. Note that the other control-xx queues do sometimes get full, but they don't continue on the upward trend until ridiculous 5-digit numbers are registered. Also, the workitem queue can sometimes get behind, but recovers nicely and has never run away.

  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-workitems' approximate message count: 2132 (-1711) [trend: 539] [1592.6]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-00' approximate message count: 73 (2) [trend: 17] [56.4]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-01' approximate message count: 53 (-45) [trend: -34] [86.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-02' approximate message count: 9 (4) [trend: -7] [16]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-03' approximate message count: 13 (-8) [trend: -4] [16.6]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-04' approximate message count: 3 (-13) [trend: -5] [7.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-05' approximate message count: 3 (-13) [trend: -2] [5.4]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 3036 (23) [trend: 107] [2929.2]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-07' approximate message count: 3296 (3) [trend: 107] [3188.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-08' approximate message count: 4438 (198) [trend: 133] [4304.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-09' approximate message count: 2 (-8) [trend: -2] [3.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-10' approximate message count: 3870 (15) [trend: 62] [3807.8]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-11' approximate message count: 5 (-14) [trend: -4] [9.2]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-12' approximate message count: 5 (-7) [trend: 0] [4.6]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-13' approximate message count: 2 (-8) [trend: -3] [4.6]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-14' approximate message count: 17 (10) [trend: 9] [8.2]
  2025-11-04T17:27:53.0778462+00:00 ** Queue 'dftaskhub20251030-control-15' approximate message count: 0 (0) [trend: -4] [4]

... and an extreme example:

  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-workitems' approximate message count: 162 (-1477) [trend: -467] [629]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-00' approximate message count: 71 (-617) [trend: -316] [386.8]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-01' approximate message count: 0 (-556) [trend: -289] [289]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-02' approximate message count: 2 (-249) [trend: -89] [91.4]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-03' approximate message count: 0 (-425) [trend: -167] [167.4]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-04' approximate message count: 1 (-4) [trend: -2] [3.4]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-05' approximate message count: 0 (-1) [trend: -2] [2]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 9780 (167) [trend: 149] [9631]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-07' approximate message count: 10206 (-169) [trend: -73] [10279]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-08' approximate message count: 12063 (-106) [trend: -29] [12092.2]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-09' approximate message count: 0 (-7) [trend: -6] [6]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-10' approximate message count: 11339 (-166) [trend: 3] [11336]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-11' approximate message count: 0 (-6) [trend: -4] [3.8]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-12' approximate message count: 2 (-3) [trend: -7] [8.6]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-13' approximate message count: 0 (-6) [trend: -2] [2.4]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-14' approximate message count: 1 (1) [trend: -1] [2]
  2025-11-04T18:05:57.4055059+00:00 ** Queue 'dftaskhub20251030-control-15' approximate message count: 1 (-2) [trend: -2] [2.8]

As indicated before, the queue can "recover" if we stop feeding inputs, but once it gets this bad, the recovery, even with no inputs, can take up to an hour. Obviously, turning off the inputs for an hour is not an option.

Orchestrator totals

It appears we do get some "throttling" in the Dispatchers due to being over the Orchestrator limit on some workers, as shown below.

Image

We wonder if the orchestrator totals have something to do with what we're seeing. Though we don't believe that throttling of this type should slow down "just a few" partitions while the others work fine, one idea I had was that the slower imbalanced queues might be tied to the throttling behavior.

In a later test, we have increased the max concurrent orchestrators to 110 from the previous 70. Though this should have staved off the limiting shown, without the full debug logs I don't believe we have a way to verify its success.

We know that increasing the maximum has not prevented the imbalanced queues problem from occurring.

If deployed to Azure

Available through the MS Ticket (Priority "A") #2510240050002919. Due to banking regs, not available here, even though this is all test data (sorry about that.... we all have our limitations)

Full Debug Logs

The timeframe for a particularly good duplication of the problem, when we had full Debug level logs for DurableTask.Core and DurableTask.AzureStorage, took place as follows. The "threshold" we use here has been found to allow for around a 5 minute recovery, if inputs are shut off immediately. Also note that during the time these queues were backed up, the others worked just fine with about 20-30 queue items being the max we see:

16:36:  Started a 100% load test
17:23:  Queues passd the expected 1500 size with >400 upward trend over the last 5 samples 
// 2025-11-04T17:23:52.8382366+00:00 ** Queue 'dftaskhub20251030-control-06' approximate message count: 1526 (829) [trend: 566] [960.2]
// 2025-11-04T17:23:52.8382366+00:00 ** Queue 'dftaskhub20251030-control-07' approximate message count: 1574 (826) [trend: 583] [990.8]
// 2025-11-04T17:23:24.3721123+00:00 ** Queue 'dftaskhub20251030-control-08' approximate message count: 1803 (616) [trend: 438] [1365.2]
// 2025-11-04T17:15:10.0191473+00:00 ** Queue 'dftaskhub20251030-control-10' approximate message count: 1586 (654) [trend: 447] [1139.2]
17:59:  Eric woke up and saw the problem occurring, called Test personnel
18:17:  Files stopped dropping after a delay in after-hour contact of test personnel
19:17:  Recovery was achieved

It should be noted, unfortunately, we only have full logs for this one date, as the cost of these logs is around $3000 per 10 hour run.

Things we've tried

Old partitioning

We did a test with the old lease-based partitioning turned on, in hopes that perhaps this would provide different behavior, since we didn't think we'd see this imbalanced behavior before. The test ran for several hours, but then the problem duplicated with the same symptoms.

Creating a new Task Hub before the test run does not help

We've done a few tests where we recreated the task hub prior to the test. This was in hopes that creating a nice new taskhub would stave off the inevitable. Unfortunately, though anecdotal evidence seemed to indicate this helped, we've now seen several tests where the problem has duplicated, even after creating a new hub.

Future Efforts

We are still perusing the logs from 2025-11-04 but hope that, due to the priority, urgency and escalation of the call noted above, the MS team will have a chance to peruse their logs and with luck, discover something not working as intended.

I'll attach further data to this issue in order to clarify the problem and any of our findings which may be helpful to the team.

Metadata

Metadata

Assignees

Labels

P1Priority 1

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions