Skip to content

ceph-csi for rbd cannot mount image after upgrade k8s to v1.31 #5066

@hasonhai

Description

@hasonhai

Describe the bug

After we upgraded the K8S cluster from 1.30.4 to 1.31.4, ceph-rbdplugin cannot mount the image anymore. It still works fine on node that has kubelet of version 1.30.4.
In the beginning, we have ceph-csi v3.12.1. The error occured, so we try upgrading to v3.13.0 to see if it can fix the issue, but it's still the same.

Environment details

  • Image/version of Ceph CSI driver : v3.13.0
  • Helm chart version :
  • Kernel version : RHEL9 5.14.0-503.19.1.el9_5.x86_64
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) : krbd
  • Kubernetes cluster version : v1.31.4
  • Ceph cluster version : v18.2.4

Steps to reproduce

Steps to reproduce the behavior:

  1. Setup details
    Storage class:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2021-04-01T13:37:08Z"
  name: dynamic-ceph-storage
  resourceVersion: "177756013"
  uid: a533c5dc-402c-4ad4-9a81-c543accbd954
mountOptions:
- nodelalloc
parameters:
  clusterID: --masked--
  csi.storage.k8s.io/controller-expand-secret-name: ceph-user-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: access-control
  csi.storage.k8s.io/fstype: ext4
  csi.storage.k8s.io/node-stage-secret-name: ceph-user-secret
  csi.storage.k8s.io/node-stage-secret-namespace: access-control
  csi.storage.k8s.io/provisioner-secret-name: ceph-user-secret
  csi.storage.k8s.io/provisioner-secret-namespace: access-control
  imageFeatures: layering
  pool: k8s-sharedpool
provisioner: rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

User permission:

[client.kube]
        key = --masked--
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=k8s-sharedpool"

We also try with the new capabilities docs but it has no help

[client.newkube]
        key = --masked--
        caps mgr = "profile rbd pool=k8s-sharedpool"
        caps mon = "profile rbd"
        caps osd = "profile rbd pool=k8s-sharedpool"
  1. Deployment to trigger the issue '....'
  2. See error
    Pod stuck in Init stage and reported error:
 Normal   Scheduled               95s                default-scheduler        Successfully assigned logging-system/aap-es-data-1 to defr4app510
  Warning  FailedAttachVolume      95s                attachdetach-controller  Multi-Attach error for volume "pvc-0091ed72-b8d3-4642-9c65-cb45ddfc328e" Volume is already exclusively attached to one node
  Normal   SuccessfulAttachVolume  85s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-0091ed72-b8d3-4642-9c65-cb45ddfc328e"
  Warning  FailedMount             18s (x8 over 84s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-0091ed72-b8d3-4642-9c65-cb45ddfc328e" : rpc error: code = Internal desc = exi

Actual results

Node can map the block device but cannot mount it. From the logs, I think the driver try to grep info of the block device using blkid command but not success. Everything works fine when we have kubelet v1.30.

Expected behavior

Node can map and mount the block device to provide to the pods.

Logs

If the issue is in PVC mounting please attach complete logs of below containers.

  • csi-rbdplugin/csi-cephfsplugin and driver-registrar container logs from
    plugin pod from the node where the mount is failing.
I0109 09:44:29.224808  941792 nodeserver.go:422] ID: 327 Req-ID: 0001-0024-4d3a09c7-d8d2-4927-91cd-08ca6601d0b2-0000000000000007-fe3ca7ee-580e-11ec-b976-a289cdd026fa rbd image: k8s-sharedpool/csi-vol-fe3ca7ee-580e-11ec-b976-a289cdd026fa was successfully mapped at /dev/rbd0
I0109 09:44:29.224926  941792 mount_linux.go:577] Attempting to determine if disk "/dev/rbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd0])
I0109 09:44:29.227079  941792 mount_linux.go:580] Output: "blkid: error: /dev/rbd0: Operation not permitted\n"
E0109 09:44:29.229984  941792 nodeserver.go:825] ID: 327 Req-ID: 0001-0024-4d3a09c7-d8d2-4927-91cd-08ca6601d0b2-0000000000000007-fe3ca7ee-580e-11ec-b976-a289cdd026fa failed to run mkfs.ext4 ([-m0 -Enodiscard,lazy_itable_init=1,lazy_journal_init=1 /dev/rbd0]) error: exit status 1, output: mke2fs 1.46.5 (30-Dec-2021)
mkfs.ext4: Operation not permitted while trying to determine filesystem size
I0109 09:44:29.311555  941792 cephcmds.go:105] ID: 327 Req-ID: 0001-0024-4d3a09c7-d8d2-4927-91cd-08ca6601d0b2-0000000000000007-fe3ca7ee-580e-11ec-b976-a289cdd026fa command succeeded: rbd [unmap /dev/rbd0 --device-type krbd --options noudev]
E0109 09:44:29.311786  941792 utils.go:245] ID: 327 Req-ID: 0001-0024-4d3a09c7-d8d2-4927-91cd-08ca6601d0b2-0000000000000007-fe3ca7ee-580e-11ec-b976-a289cdd026fa GRPC error: rpc error: code = Internal desc = exit status 1

Note:- If its a rbd issue please provide only rbd related logs, if its a
cephFS issue please provide cephFS logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions