Skip to content

MicroK8s HA Cluster on Raspberry Pi 5 Fails to Automatically Recover Pods with VolumeClaim on MicroCeph #5131

@royolsen

Description

@royolsen

Summary:

A three-node MicroK8s high-availability cluster running on Raspberry Pi 5s with Ubuntu 24.04 LTS and using MicroCeph for storage does not correctly self-heal stateful applications after a node failure. While the failover process initiates correctly, the rescheduled pod becomes indefinitely stuck in the ContainerCreating state on the new node. The issue appears to be related to the final step of mounting the Ceph RBD volume. This behavior is consistent and reproducible.

Environment:

Hardware: 3x Raspberry Pi 5 (8GB Model)

Storage: NVMe SSD on each node

Operating System: Ubuntu Server 24.04.2 LTS (on all nodes)

MicroK8s Version: 1.32.3

MicroCeph Version: 19.2.0-lts

Networking: Calico CNI (default with MicroK8s)

Steps to Reproduce:

Set up a three-node cluster:

Install and update Ubuntu Server 24.04.2 on three Raspberry Pi 5 nodes.

Install microk8s, microceph snaps on all three nodes.

Initialize a three-node HA MicroK8s cluster using microk8s add-node

Initialize a three-node MicroCeph cluster by adding a dedicated partition from each node using microceph disk add.

Integrate MicroK8s with MicroCeph:

Run sudo microk8s enable rook-ceph.

Run sudo microk8s connect-external-ceph

Deploy a Stateful Application:

Create a PersistentVolumeClaim (PVC) using the ceph-rbd StorageClass.

Create a Deployment with replicas: 1 that mounts the PVC. Use a simple image like busybox that writes a file to the volume and then sleeps.

Simulate Node Failure:

Identify which node the pod is running on (e.g., node-3) using microk8s kubectl get pod -o wide.

Power off that node (node-3).

Observe the Failover Process:

On a healthy node (e.g., node-1), watch the pods with watch microk8s kubectl get pod -o wide.

The pod on the failed node (node-3) correctly enters a Terminating state after the ~5-minute eviction grace period.

A new replacement pod is correctly scheduled onto a healthy node (e.g., node-2).

The new pod enters the ContainerCreating state and correctly displays a Multi-Attach error in its events for ~5-6 minutes.

Actual Behavior (The Bug):

After the Multi-Attach error clears from the pod's events, the pod remains indefinitely stuck in the ContainerCreating state.

Running microk8s kubectl describe pod on the new pod shows no further events, indicating a silent failure.

The Kubelet logs on the target node (node-2) are empty and show no errors related to the pod startup.

Crucially, deploying a similar deployment pod, volumeclaim etc. on the same target node (node-2) works perfectly, proving the node's container runtime and CNI networking are fundamentally healthy.

The issue appears to be specific to the final step of mounting the persistent Ceph RBD volume into the rescheduled stateful pod after a node failure event.

Workaround:

Not much of a workaround, although fixing the failed node (ie. in the case of this simulated failure - turning the power back on) clears the issue.

Expected Behavior:

The rescheduled stateful pod should automatically start on a healthy node after the storage detachment timeout without requiring any manual intervention.

Conclusion:

This reproducible issue indicates a critical flaw in the high-availability and self-healing capabilities of a standard MicroK8s + MicroCeph deployment on this ARM64 hardware/software stack. The failover process for stateful workloads is not fully automatic and requires manual intervention, which defeats the purpose of an HA cluster.

Inspection report:

Created in failed state, one node down and pod stuck in ContainerCreating:

inspection-report-20250703_162617.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions