Skip to content

[BUG] [v1.10.0-rc3] V2 volume stuck in volume attachment (V2 interrupt mode) #11816

@mcerveny

Description

@mcerveny

Describe the Bug

Try to attach volume V2 volume (V2 in interrupt mode, pool mode works) does not never finish (disks are native NVMe).

[longhorn-instance-manager] time="2025-09-18T13:57:29.642404886Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:116" dataEngine=DATA_ENGINE_V2 name=v2test36-r-a9f41bfc type=replica upgradeRequired=false
[2025-09-18 13:57:29.662629] bdev.c:8723:bdev_open_ext: *NOTICE*: Currently unable to find bdev with name: d83833ab97eacecdeb0a188f234799cfn1/v2test36-r-a9f41bfc
[longhorn-instance-manager] time="2025-09-18T13:57:29.674625164Z" level=info msg="Replica created a new head lvol" func="log.(*SafeLogger).Info" file="log.go:66" lvsName=d83833ab97eacecdeb0a188f234799cfn1 lvsUUID=fb68f02b-b1f2-4ac9-8b35-18621a8e7f93 replicaName=v2test36-r-a9f41bfc
[2025-09-18 13:57:29.718621] tcp.c: 759:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
[2025-09-18 13:57:29.742770] tcp.c:1103:nvmf_tcp_listen: *NOTICE*: *** NVMe/TCP Target Listening on 10.33.200.8 port 20001 ***
[longhorn-instance-manager] time="2025-09-18T13:57:29.751752709Z" level=info msg="Created replica" func="log.(*SafeLogger).Info" file="log.go:66" lvsName=d83833ab97eacecdeb0a188f234799cfn1 lvsUUID=fb68f02b-b1f2-4ac9-8b35-18621a8e7f93 replicaName=v2test36-r-a9f41bfc
[longhorn-instance-manager] time="2025-09-18T13:57:30.83629071Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:116" dataEngine=DATA_ENGINE_V2 name=v2test36-e-0 type=engine upgradeRequired=false
[longhorn-instance-manager] time="2025-09-18T13:57:30.837017197Z" level=info msg="Creating engine" func="spdk.(*Engine).Create" file="engine.go:203" engineName=v2test36-e-0 frontend=spdk-tcp-blockdev initiatorAddress=10.33.200.8 portCount=1 replicaAddressMap="map[v2test36-r-23f3f887:10.33.200.4:20001 v2test36-r-a28dabf0:10.33.200.3:20001 v2test36-r-a9f41bfc:10.33.200.8:20001]" salvageRequested=false targetAddress=10.33.200.8 volumeName=v2test36
[longhorn-instance-manager] time="2025-09-18T13:57:30.840589422Z" level=info msg="Creating both initiator and target instances" func="log.(*SafeLogger).Info" file="log.go:66" engineName=v2test36-e-0 frontend=spdk-tcp-blockdev volumeName=v2test36
[2025-09-18 13:57:30.842613] bdev.c:8723:bdev_open_ext: *NOTICE*: Currently unable to find bdev with name: v2test36-e-0
[2025-09-18 13:57:30.850597] bdev_nvme.c:7088:spdk_bdev_nvme_delete: *ERROR*: Failed to find NVMe bdev controller
[2025-09-18 13:57:30.858604] bdev_nvme.c:6762:spdk_bdev_nvme_create: *NOTICE*: Updating global NVMe transport type (g_nvme_trtype) from PCIe to TCP (base-name: v2test36-r-a9f41bfc)
[2025-09-18 13:57:30.917166] nvme_transport.c: 580:nvme_qpair_connect_completion_cb: *NOTICE*: NVMe qpair 0x3522e00 connected successfully.

expected next line of log - build raid1 (from pool mode V2):

[longhorn-instance-manager] time="2025-09-18T12:44:08.448979351Z" level=info msg="Connecting all available replicas map[v2test33-r-007d8fdc:0xc001183a10 v2test33-r-67ddf076:0xc001183410 v2test33-r-b48a5efb:0xc001302bd0], then launching raid during engine creation" func="log.(*SafeLogger).Infof" file="log.go:73" engineName=v2test33-e-0 frontend=spdk-tcp-blockdev initiatorIP=10.33.200.4 replicaStatusMap="map[v2test33-r-007d8fdc:0xc001183a10 v2test33-r-67ddf076:0xc001183410 v2test33-r-b48a5efb:0xc001302bd0]" targetIP=10.33.200.4 volumeName=v2test33

To Reproduce

Expected Behavior

successful attachment

Support Bundle for Troubleshooting

Many tries in support bundle, the last one is volume "v2test36" cretaed "13:56:*", try to attach "13:57:*", try to delete "14:09:*", restarted v2 instance-managers "14:11:*", orphaned delete V2 volumes "14:13:*".

supportbundle_09afb238-c96d-4010-9f95-f6de59c721df_2025-09-18T14-20-06Z.zip

Environment

  • Longhorn version: v1.10.0-rc3
  • Impacted volume (PV): v2test36
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: harvester v1.6.0 -> rke2
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: SLE Micro 5.5 / Harvester v1.6.0
    • Kernel version: 5.14.21-150500.55.116-default
    • CPU per node: 8C/16T
    • Memory per node: >=64GB
    • Disk type (e.g. SSD/NVMe/HDD): 2xNVMe
    • Network bandwidth between the nodes (Gbps): LACP-2x2.5Gb/s
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal NVMe
  • Number of Longhorn volumes in the cluster: many V1, none V2

Additional context

No response

Workaround and Mitigation

No response

Metadata

Metadata

Assignees

Labels

area/spdkSPDK upstream/downstreamarea/v2-data-enginev2 data engine (SPDK)kind/bugpriority/0Must be implement or fixed in this release (managed by PO)require/backportRequire backport. Only used when the specific versions to backport have not been definied.require/qa-review-coverageRequire QA to review coverage

Type

Projects

Status

Resolved

Status

Implement

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions