Skip to content

Support bundle stuck in collecting status (should use multi-part upload to sled agent) #8556

Open
@askfongjojo

Description

@askfongjojo

I made a new bundle on rack3 (the first time ever in this environment) and its state didn't not progress beyond collecting after many hours.

oxide --profile colo bundle list
[
  {
    "id": "c6b507df-cb67-47c4-8887-ba9fc0fc0034",
    "reason_for_creation": "Created by external API",
    "state": "collecting",
    "time_created": "2025-07-09T01:01:12.246530Z"
  }
]

I looked up the bundle's dataset info from the database

root@[fd00:1122:3344:116::3]:32221/omicron> select * from support_bundle;
                   id                  |         time_created         |   reason_for_creation   | reason_for_failure |   state    |               zpool_id               |              dataset_id              |            assigned_nexus
---------------------------------------+------------------------------+-------------------------+--------------------+------------+--------------------------------------+--------------------------------------+---------------------------------------
  c6b507df-cb67-47c4-8887-ba9fc0fc0034 | 2025-07-09 01:01:12.24653+00 | Created by external API | NULL               | collecting | de682b18-afaf-4d53-b62e-934f6bd4a1f8 | 003d27e0-57e4-4d55-963e-af47e4e526f1 | 95ebe94d-0e68-421d-9260-c30bd7fe4bd6
(1 row)

The dataset that was supposed to receive the bundle remained empty:

BRM42220015 # ls -l /pool/ext/de682b18-afaf-4d53-b62e-934f6bd4a1f8/crypt/debug/c6b507df-cb67-47c4-8887-ba9fc0fc0034/
total 0

The assigned nexus log showed that the collector background task was doing the work:

angela@castle /staff/angela $ grep c6b507df oxide-nexus.log.1752023701 | head -30 | looker
01:01:18.330Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): SupportBundleCollector: Found bundle to collect
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    bundles_in_queue = 1
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:364
01:01:18.330Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Collecting bundle as local file
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:562
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/local/all-sp-ids
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/local/all-sp-ids", status: 200, headers: {"content-type": "application/json", "x-request-id": "46a2fcee-08ee-49a9-8f72-4329dd215192", "content-length": "929", "date": "Wed, 09 Jul 2025 01:01:17 GMT"} })
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/7/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/24/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/switch/0/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/28/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/20/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/switch/1/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/29/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/22/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/0/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/5/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/4/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/6/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/9/task-dump
01:01:18.367Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/power/0/task-dump
01:01:18.368Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/31/task-dump
01:01:18.368Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/13/task-dump
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/7/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "bf0a7286-96f8-41aa-9ee7-3dc58ee4f194", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/power/0/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "b8704e84-5aeb-43e8-8772-63ecfec50588", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/3/task-dump
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client request
    background_task = support_bundle_collector
    body = None
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    method = GET
    uri = http://[fd00:1122:3344:11f::2]:12225/sp/sled/2/task-dump
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/28/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "4c05f0a6-8541-40c1-b34d-0366f22fa767", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/5/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "838690ad-3c40-4bd7-90a8-5f416ecc19ad", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/6/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "3d38a033-ce0f-4587-b977-a16ca9e86162", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/29/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "1ab05012-1d88-4098-92b4-6132508078e9", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/sled/0/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "78360e15-080d-4f1a-abeb-a847c66cd10c", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })
01:01:18.373Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): client response
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    result = Ok(Response { url: "http://[fd00:1122:3344:11f::2]:12225/sp/switch/0/task-dump", status: 200, headers: {"content-type": "application/json", "x-request-id": "c92d82c0-63c2-4500-be40-98fcd18a0016", "content-length": "1", "date": "Wed, 09 Jul 2025 01:01:18 GMT"} })

In between, the collector hit some errors:

01:01:33.380Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP power 1: failed to get task dump count from SP: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "424eb413-81c0-4c33-82c0-0a029020114d", "content-length": "198", "date": "Wed, 09 Jul 2025 01:01:20 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Power, slot: 1 }: no SP discovered", request_id: "424eb413-81c0-4c33-82c0-0a029020114d" }
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.380Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 24: failed to get task dump count from SP: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "be60285f-baf5-4b8c-b419-6ac16e4efc58", "content-length": "198", "date": "Wed, 09 Jul 2025 01:01:20 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 24 }: no SP discovered", request_id: "be60285f-baf5-4b8c-b419-6ac16e4efc58" }
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.381Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 9: failed to get task dump count from SP: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "d6f417ea-c5ec-419d-bedf-e006360a9ebf", "content-length": "197", "date": "Wed, 09 Jul 2025 01:01:20 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 9 }: no SP discovered", request_id: "d6f417ea-c5ec-419d-bedf-e006360a9ebf" }
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.381Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 22: failed to get task dump count from SP: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "0b44e793-3d95-4774-bbf4-f2628d3cae40", "content-length": "224", "date": "Wed, 09 Jul 2025 01:01:31 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 22 }: RPC call failed (gave up after 5 attempts)", request_id: "0b44e793-3d95-4774-bbf4-f2628d3cae40" }
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.381Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 4: failed to get task dump count from SP: Communication Error: error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/4/task-dump): error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/4/task-dump): operation timed out
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.381Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 18: failed to get task dump count from SP: Communication Error: error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/18/task-dump): error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/18/task-dump): operation timed out
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031
01:01:33.381Z ERRO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed to capture task dumps
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    error = SP sled 30: failed to get task dump count from SP: Communication Error: error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/30/task-dump): error sending request for url (http://[fd00:1122:3344:11f::2]:12225/sp/sled/30/task-dump): operation timed out
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:1031

And it was supposedly completed after 6+ mins:

01:07:36.752Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Bundle Collection completed
    background_task = support_bundle_collector
    bundle = c6b507df-cb67-47c4-8887-ba9fc0fc0034
    file = nexus/src/app/background/tasks/support_bundle_collector.rs:485

It's unclear if the errors caused the bundle to not be persisted or there were some other errors it hit that contributed to the stuck status.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions