This Ansible-based project provisions Erhhung's high-availability Kubernetes cluster at home named homelab, and deploys services for monitoring various IoT appliances, as well as for deploying other personal projects, including self-hosted LLMs and RAG pipelines that enable multi-source hybrid searches and agentic automations using local knowledge base containing vast amounts of personal and sensor data.
The approach taken on all service deployments is to treat the clusters as a production environment (to the extent possible with limited resources and scaling capacity across a few mini PCs). That means TLS everywhere and requiring authenticated user access, scraping metrics, and configuring dashboards and alerts.
The top-level Ansible playbook main.yml run by play.sh will provision 7 VM hosts (rancher and k8s1..k8s6)
in the existing XCP-ng Homelab pool, all running Ubuntu Server 24.04 Minimal without customizations besides basic networking
and authorized SSH key for user erhhung.
A single-node K3s Kubernetes cluster will be installed on host rancher along with Rancher Server on that cluster, and a 6-node RKE2 Kubernetes cluster with high-availability control plane using a virtual IP will be installed on hosts
k8s1..k8s6. MetalLB will be installed and configured in BGP mode on the 6-node cluster to load-balance external traffic among cluster nodes using ECMP routing provided by pfSense and FRR.
Longhorn and NFS storage provisioners will be installed in each cluster to manage a pool of LVM logical volumes on each node, and to expand the overall storage capacity onto the QNAP NAS. MinIO will also be installed, serving as S3-compatible object storage backed by NFS volumes on QNAP.
All cluster services will be provisioned with TLS certificates from Erhhung's private CA server at pki.fourteeners.local (or its faster mirror at cosmos.fourteeners.local) with the help of cert-manager and Step CA.
-  K3s Kubernetes Cluster — lightweight Kubernetes distro for resource-constrained environments
- Install on the rancherhost using the official install script
 
- Install on the 
-  Rancher Cluster Manager — provision (or import), manage, and monitor Kubernetes clusters
- Install on K3s cluster using the rancherHelm chart
 
- Install on K3s cluster using the 
-  RKE2 Kubernetes Cluster — Kubernetes distribution with focus on security and compliance
- Install on hosts k8s1-k8s4using the RKE2 Ansible Role with HA mode enabled
 
- Install on hosts 
-  MetalLB Load Balancer — network load-balancer for "bare metal" Kubernetes clusters
- Install on main RKE cluster using Bitnami's metallbHelm chart
 
- Install on main RKE cluster using Bitnami's 
-  Certificate Manager — X.509 certificate management for Kubernetes
- Install on K3s and RKE clusters using the cert-managerHelm chart
-  Connect to Step CA pki.fourteeners.localusing thestep-issuerHelm chart
-  Connect to Step CA pki.fourteeners.localas an ACMEClusterIssuer
 
- Install on K3s and RKE clusters using the 
-  Node Feature Discovery — label nodes with available hardware features, like GPUs
- Install on K3s and RKE clusters using the node-feature-discoveryHelm chart
-  Install Intel Device Plugins using the intel-device-plugins-operatorHelm chart
- Install NVIDIA GPU Operator on RKE cluster ... when I procure an NVIDIA card :(
 
- Install on K3s and RKE clusters using the 
-  Wave Config Monitoring — ensure pods run with up-to-date ConfigMapsandSecrets- Install on K3s and RKE clusters using the waveHelm chart
 
- Install on K3s and RKE clusters using the 
-  Longhorn Block Storage — distributed block storage for Kubernetes
- Install on main RKE cluster using the longhornHelm chart
 
- Install on main RKE cluster using the 
-  NFS Dynamic Provisioner — create persistent volumes on NFS shares
- Install on K3s and RKE clusters using the nfs-subdir-external-provisionerHelm chart
 
- Install on K3s and RKE clusters using the 
-  MinIO Object Storage — S3-compatible object storage with console
- Install on main RKE cluster using the MinIO Operator and MinIO Tenant Helm charts
 
-  Velero Backup & Restore — back up and restore persistent volumes
- Install on main RKE cluster using the veleroHelm chart
-  Install Velero Dashboard using the velero-uiHelm chart
 
- Install on main RKE cluster using the 
-  Harbor Container Registry — private OCI container and Helm chart registry
- Install on K3s cluster using the harborHelm chart
 
- Install on K3s cluster using the 
-  OpenSearch Logging Stack — aggregate and filter logs using OpenSearch and Fluent Bit
- Install on main RKE cluster using the opensearchandopensearch-dashboardsHelm charts
- Instal Fluent Bit using the fluent-operatorHelm chart andFluentBitCR
 
- Install on main RKE cluster using the 
-  PostgreSQL Database — SQL database used by Keycloak and other applications
- Install on main RKE cluster using Bitnami's postgresql-haHelm chart
 
- Install on main RKE cluster using Bitnami's 
-  Keycloak IAM & OIDC Provider — identity and access management and OpenID Connect provider
- Install on main RKE cluster using the keycloakxHelm chart
 
- Install on main RKE cluster using the 
-  Valkey Key/Value Store — Redis-compatible key/value store
- Install on main RKE cluster using the valkey-clusterHelm chart
 
- Install on main RKE cluster using the 
-  Prometheus Monitoring Stack — Prometheus (via Operator), Thanos sidecar, and Grafana
- Install on main RKE cluster using the kube-prometheus-stackHelm chart
-  Add authentication to Prometheus and Alertmanager UIs using oauth2-proxysidecar
-  Install other Thanos components using Bitnami's thanosHelm chart for global querying
- Enable the OTLP receiver endpoint for metrics (when needed)
 
- Install on main RKE cluster using the 
-  Istio Service Mesh with Kiali Console — secure, observe, trace, and route traffic between workloads
- Install on main RKE cluster using the istioctlCLI
- Install Kiali using the kiali-operatorHelm chart andKialiCR
 
- Install on main RKE cluster using the 
-  HashiCorp Vault and External Secrets Operator — secure secrets management and synchronization
- Install on main RKE cluster using the vaultandexternal-secretsHelm charts
 
- Install on main RKE cluster using the 
-  GitLab CI/CD Platform — run CI/CD pipelines for local deployments
- Install on main RKE cluster using the gitlabHelm chart
-  Install GitLab CI Pipelines Exporter using the gitlab-ci-pipelines-exporterHelm chart
 
- Install on main RKE cluster using the 
-  Argo CD Declarative GitOps — manage deployment of personal projects
- Install on main RKE cluster using the argo-cdHelm chart
 
- Install on main RKE cluster using the 
-  Meshery Visual GitOps Platform — manage infrastructure visually and collaboratively
- Install on K3s cluster using the mesheryHelm chart, along with
 meshery-istioandmeshery-nighthawkadapters
- Connect to main RKE cluster, along with Prometheus and Grafana
 
- Install on K3s cluster using the 
-  Kubernetes Metacontroller — enable easy creation of custom controllers
- Install on main RKE cluster using the metacontrollerHelm chart
 
- Install on main RKE cluster using the 
-  Ollama LLM Server with Ollama CLI — run LLMs on Kubernetes cluster
- Install on an Intel GPU node using the ollamaHelm chart and IPEX-LLM Ollama portable zip
 
- Install on an Intel GPU node using the 
-  Open WebUI AI Platform — extensible AI platform with Ollama integration and local RAG support
- Install on main RKE cluster using the open-webuiHelm chart
-  Replace the default Chroma vector DB with Qdrant — install using the qdrantHelm chart
 
- Install on main RKE cluster using the 
-  Flowise Agentic Workflows — build AI agents using visual workflows
- Install on main RKE cluster using the flowiseHelm chart
 
- Install on main RKE cluster using the 
-  OpenTelemetry Collector with Jaeger UI — telemetry collector agent and distributed tracing backend
- Install on main RKE cluster using the OpenTelemetry Collector Helm chart
- Install Jaeger using the Jaeger Helm chart
 
- Backstage Developer Portal — software catalog and developer portal
- NATS — high performance message queues (Kafka alternative) with JetStream for persistence
-  Migrate manually provisioned certificates and secrets to ones issued by cert-managerwith auto-rotation
- Identify and upload additional sources of personal documents into Open WebUI knowledge base collections
- Automate creation of DNS records in pfSense via custom Ansible module that invokes pfSense REST APIs
The Ansible Vault password is stored in macOS Keychain under item "Home-K8s" for account "ansible-vault"
export ANSIBLE_CONFIG="./ansible.cfg"
VAULTFILE="group_vars/all/vault.yml"
ansible-vault create $VAULTFILE
ansible-vault edit   $VAULTFILE
ansible-vault view   $VAULTFILESome variables stored in Ansible Vault (there are many more)
| Infrastructure Secrets | User Passwords | 
|---|---|
| sudo_pass.* | rancher_admin_pass | 
| icloud_smtp.* | minio_root_pass | 
| docker_access_token | minio_admin_pass | 
| github_access_token | velero_admin_pass | 
| age_secret_key | harbor_admin_pass | 
| metallb_secret | opensearch_admin_pass | 
| step_ca_provisioner_pass | keycloak_admin_pass | 
| minio_client_pass | thanos_admin_pass | 
| velero_repo_pass | grafana_admin_pass | 
| velero_passphrase | vault_admin_pass | 
| harbor_secret | gitlab_root_pass | 
| dashboards_os_pass | gitlab_user_pass | 
| fluent_os_pass | argocd_admin_pass | 
| valkey_pass | openwebui_admin_pass | 
| postgresql_pass | flowise_admin_pass | 
| keycloak_db_pass | |
| monitoring_pass | |
| monitoring_oidc_client_secret.* | |
| slack_webhook_url.* | |
| oauth2_proxy_cookie_secret | |
| kiali_oidc_client_secret | |
| gitlab_secrets_data.* | |
| gitlab_omniauth.* | |
| argocd_signing_key | |
| hass_access_token | |
| qdrant_api_key.* | |
| openwebui_secret_key | |
| pipelines_api_key | |
| flowise_encryption_key | |
| anthropic_api_key | |
| openai_api_key | |
| groq_api_key | 
All managed hosts are running Ubuntu 24.04 with SSH key from https://github.com/erhhung.keys already authorized.
Ansible will authenticate as user erhhung using private key "~/.ssh/erhhung.pem";
however, all privileged operations using sudo will require the password stored in Vault.
- 
Install required packages
 1.1. Tools — lsof,jq,yq,git,helm, etc.
 1.2. Drivers — NFS and Intel client GPU drivers
 1.3. Python — Ansible packages in virtual env
 1.4. Helm — plugins likehelm-diff,helm-git
 1.5. Debugging — Tools liketcpdump,tshark./play.sh packages 
- 
Configure system settings
 2.1. Host — host name, time zone, and locale 
 2.2. Kernel —sysctlparams andpam_limits
 2.3. Network — DNS servers and search domains
 2.4. Login — customize login MOTD messages
 2.5. Certs — add CA certificates to trust store./play.sh basics 
- 
Set up admin user's home directory
 3.1. Dot files: .bash_aliases, etc.
 3.2. Config files:htop,fastfetch./play.sh files 
- 
Install Rancher Server on single-node K3s cluster
 ./play.sh rancher 
- 
Provision Kubernetes cluster with RKE on 6 nodes
 Install RKE2 with a single control plane node and 5 worker nodes, all permitting workloads, 
 RKE2 in HA mode with 3 control plane nodes and 3 worker nodes, all permitting workloads.
 Cluster will be accessible using a virtual IP address provisioned bykube-vipin HA mode.5.1. Deploy another NGINX ingress controller for SSL passthrough ./play.sh cluster 
- 
Install MetalLB network load-balancer in BGP mode
 6.1. Create BGPPeer,IPAddressPool, andBGPAdvertisementCRs
 to complement FRR BGP configuration on pfSense, the local router./play.sh metallb 
- 
Installcert-managerto automate certificate issuing
 7.1. Connect to Step CA pki.fourteeners.localas aStepClusterIssuer./play.sh certmanager 
- 
Install Node Feature Discovery to identify GPU nodes
 8.1. Install Intel Device Plugins and GpuDevicePlugin./play.sh nodefeatures 
- 
Install Wave to monitorConfigMapsandSecrets
 ./play.sh wave 
- 
Install Longhorn dynamic PV provisioner
 Install MinIO object storage in HA mode
 Install Velero backup and restore tools
 10.1. Create a pool of LVM logical volumes 
 10.2. Install Longhorn storage components
 10.3. Install NFS dynamic PV provisioner
 10.4. Install MinIO tenant using NFS PVs
 10.5. Create MinIO buckets, users, groups
 10.6. Install Velero using MinIO as target
 10.7. Install Velero Dashboard./play.sh storage minio velero 
- 
Create resources from manifest files
 IMPORTANT: Resource manifests must specify the namespaces they wished to be installed 
 into because the playbook simply applies each one without targeting a specific namespace./play.sh manifests 
- 
Install Harbor OCI & Helm registry
 12.1. Automatically populate Harbor with select images from external registries ./play.sh harbor 
- 
Install OpenSearch cluster in HA mode
 13.1. Configure the OpenSearch security plugin (users and roles) for downstream applications 
 13.2. Install OpenSearch Dashboards UI./play.sh opensearch 
- 
Install Fluent Bit to ingest logs into OpenSearch
 ./play.sh logging 
- 
Install PostgreSQL database in HA mode
 15.1. Run initialization SQL script to create roles and databases for downstream applications 
 15.2. Create users in both PostgreSQL and Pgpool./play.sh postgresql 
- 
Install Keycloak IAM & OIDC provider
 16.1. Bootstrap PostgreSQL database with realm homelab, usererhhung, and OIDC clients./play.sh keycloak 
- 
Install Valkey key-value store in HA mode
 17.1. Deploy 6 nodes in total: 3 primaries and 3 replicas ./play.sh valkey 
- 
Install Prometheus, Thanos, and Grafana in HA mode
 18.1. Expose Prometheus & Alertmanager UIs via oauth2-proxyintegration with Keycloak
 18.2. Connect Thanos sidecars to MinIO to store scraped metrics in themetricsbucket
 18.3. Deploy and integrate other Thanos components with Prometheus and Alertmanager./play.sh monitoring thanos 
- 
Install Istio service mesh in ambient mode
 ./play.sh istio 
- 
Install HashiCorp Vault in HA mode
 Install External Secrets Operator
 20.1. Initialize Vault cluster and unseal cluster pods 
 20.2. Create policies,Userpassaccounts, k8s roles
 20.3. CreateKVmounts and populate secrets data
 20.4. Create ESO'sClusterSecretStorefor Vault./play.sh vault externalsecrets 
- 
Install GitLab EE CI/CD Platform to build local images
 21.1. Import Erhhung's SSH and GPG public keys, and create the Homelabgroup
 21.2. Configure Harbor and Slack integrations, and access GitHub using OmniAuth
 21.3. Configure and deploy Kubernetes runner for building images usingbuildah
 21.4. Useal2023-devopsas the build container and load common pre-build script
 21.5. Deploy CI Pipelines Exporter to export metrics and visualize them in Grafana./play.sh gitlab 
- 
Install Argo CD GitOps delivery in HA mode
 22.1. Configure Argo CD to use Valkey for caching 
 22.2. Configure GitLab as an allowed SCM provider./play.sh argocd 
- 
Install Metacontroller to create Operators
 ./play.sh metacontroller 
- 
Install Qdrant vector database in HA mode
 ./play.sh qdrant 
- 
Install Ollama LLM server with common models
 Install Open WebUI AI platform with Pipelines
 25.1. Create Accountsknowledge base, and thenAccountscustom model that embeds that KB
 25.2. NOTE: PopulateAccountsKB by running./play.sh openwebui -t knowledgeseparately./play.sh ollama openwebui 
- 
Install Flowise AI platform and integrations
 Current deployment uses local images in Harbor registry that were built by GitLab CI. ./play.sh flowise 
- 
Create virtual Kubernetes clusters in RKE
 ./play.sh vclusters 
Alternatively, run all playbooks automatically in order:
# pass options like -v and --step
./play.sh [ansible-playbook-opts]
# run all playbooks starting from "storage"
# ("storage" is a playbook tag in main.yml)
./play.sh storage-
# run all playbooks up to "wave" (inclusive)
./play.sh -waveOutput from play.sh will be logged in "ansible.log".
The default Bash shell for VS Code integrated terminal has been configured to load a custom .bash_profile containing aliases for common Ansible-related commands, as well as functions play and debug with completions for tags in playbooks main.yml and debug.yml, respectively.
Due to the dependency chain of the Prometheus monitoring stack (Keycloak and Valkey), the monitoring.yml playbook must be run after most other playbooks. At the same time, those dependent services also want to create ServiceMonitor resources that require the Prometheus Operator CRDs. Therefore, a second pass through all playbooks, starting with certmanager.yml, is required to enable metrics collection on those services.
- 
Shut down all/specific VMs ansible-playbook shutdownvms.yml [-e targets={group|host|,...}] 
- 
Create/revert/delete VM snapshots 2.1. Create new snaphots
 ansible-playbook snapshotvms.yml [-e targets={group|host|,...}] \ -e '{"desc":"text description"}' 2.2. Revert to snapshots
 ansible-playbook snapshotvms.yml -e do=revert \ [-e targets={group|host|,...}] \ -e '{"desc":"text to search"}' \ [-e '{"date":"YYYY-mm-dd prefix"}']2.3. Delete old snaphots
 ansible-playbook snapshotvms.yml -e do=delete \ [-e targets={group|host|,...}] \ -e '{"desc":"text to search"}' \ -e '{"date":"YYYY-mm-dd prefix"}'
- 
Start all/specific VMs ansible-playbook startvms.yml [-e targets={group|host|,...}] 
To expand the VM disk on a cluster node, the VM must be shut down
(attempting to resize the disk from Xen Orchestra will fail with
error: VDI in use).
Once the VM disk has been expanded, restart the VM and SSH into the node to resize the partition and LV.
$ sudo su
# verify new size
$ lsblk /dev/xvda
# resize partition
$ parted /dev/xvda
) print
Warning: Not all of the space available to /dev/xvda appears to be used...
Fix/Ignore? Fix
) resizepart 3 100%
# confirm new size
) print
) quit
# sync with kernel
$ partprobe
# confirm new size
$ lsblk /dev/xvda3
# resize VG volume
$ pvresize /dev/xvda3
Physical volume "/dev/xvda3" changed
1 physical volume(s) resized...
# confirm new size
$ pvdisplay
# show LV volumes
$ lvdisplay
# set exact LV size (G=GiB)
$ lvextend -vrL 50G /dev/ubuntu-vg/ubuntu-lv
# or grow LV by percentage
$ lvextend -vrl +90%FREE /dev/ubuntu-vg/ubuntu-lv
Extending logical volume ubuntu-vg/ubuntu-lv to up to...
fsadm: Executing resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
The filesystem on /dev/mapper/ubuntu--vg-ubuntu--lv is now...After expanding all desired disks, run ./diskfree.sh
to confirm available disk space on all cluster nodes.
rancher
-------
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   44G   16G   27G  38% /
k8s1
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   22G   29G  44% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /data
k8s2
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   18G   33G  36% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /data
k8s3
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   23G   28G  46% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /data
k8s4
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   20G   31G  39% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /data
k8s5
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   16G   35G  32% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /data
k8s6
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   53G   23G   28G  46% /
/dev/mapper/ubuntu--vg-data--lv     60G  1.2G   59G   2% /dataAnsible's ad-hoc commands are useful in these scenarios.
- 
Restart Kubernetes cluster services on all nodes ansible rancher -m ansible.builtin.service -b -a "name=k3s state=restarted" ansible control_plane_ha -m ansible.builtin.service -b -a "name=rke2-server state=restarted" ansible workers_ha -m ansible.builtin.service -b -a "name=rke2-agent state=restarted" NOTE: remove _hasuffix from target hosts if the RKE cluster was deployed in non-HA mode.
- 
All kube-proxystatic pods on continuousCrashLoopBackOffThis turns out to be a Linux kernel bug in linux-image-6.8.0-56-genericand above (discovered on upgrade tolinux-image-6.8.0-57-generic), causing this error in the container logs:ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"Current workaround is to downgrade to an earlier kernel.
 # list installed kernel images ansible -v k8s_all -a 'bash -c "dpkg -l | grep linux-image"' # install working kernel image ansible -v k8s_all -b -a 'apt-get install -y linux-image-6.8.0-55-generic' # GRUB use working kernel image ansible -v rancher -m ansible.builtin.shell -b -a ' kernel="6.8.0-55-generic" dvuuid=$(blkid -s UUID -o value /dev/xvda2) menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid" sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub grep GRUB_DEFAULT /etc/default/grub ' ansible -v cluster -m ansible.builtin.shell -b -a ' kernel="6.8.0-55-generic" dvuuid=$(blkid -s UUID -o value /dev/mapper/ubuntu--vg-ubuntu--lv) menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid" sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub grep GRUB_DEFAULT /etc/default/grub ' # update /boot/grub/grub.cfg ansible -v k8s_all -b -a 'update-grub' # reboot nodes, one at a time ansible -v k8s_all -m ansible.builtin.reboot -b -a "post_reboot_delay=120" -f 1 # confirm working kernel image ansible -v k8s_all -a 'uname -r' # remove old backup kernels only # (keep latest non-working kernel # so upgrade won't install again) ansible -v k8s_all -b -a 'apt-get autoremove -y --purge' 
- 
StatefulSet pod stuck on ContainerCreatingdue toMountDevice failedPod lifecycle events show an error like: MountVolume.MountDevice failed for volume "pvc-4151d201-437b-4ceb-bbf6-c227ea49e285": kubernetes.io/csi: attacher.MountDevice failed to create dir "/var/lib/kubelet/plugins/kubernetes.io/ csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount": mkdir /var/lib/kubelet/plugins/kubernetes.io/.../globalmount: file existsProblem is described by this GitHub issue, which may be caused by restarting the node while a Longhorn volume backup is in-progress. An effective workaround is to unmount that volume.
 $ ssh k8s1 $ mount | grep pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 /dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/longhorn/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285 on /var/lib/kubelet/pods/06fc67d7-833f-4ecd-810f-77787fd703e6/volumes/kubernetes.io~csi/pvc-4151d201-437b-4ceb-bbf6-c227ea49e285/mount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota) $ sudo umount /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/0bb8a8bc36ca16f14a425e5eaf35ed51af6096bf0302129a05394ce51393cecd/globalmount Or if pod events show an error like: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/ 1508f1bfa1a751aaa24514b7576847e7f7ac042c6d8295a6d07417fb4e0068f1/globalmount: mount system call failed: Structure needs cleaning.Problem is likely caused by an abrupt node shutdown and file system was not unmounted cleanly. An effective solution, albeit possibly with some data loss, is to repair that XFS volume.
 $ ssh k8s4 $ mount | grep 1508f1bfa1a751aaa24514b7576847e7f7ac042c6d8295a6d07417fb4e0068f1 /dev/longhorn/pvc-7bc42f2c-4bb6-42f4-ad31-a9fa27185103 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/ 1508f1bfa1a751aaa24514b7576847e7f7ac042c6d8295a6d07417fb4e0068f1/globalmount type xfs (rw,relatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota) $ sudo xfs_repair -L /dev/longhorn/pvc-7bc42f2c-4bb6-42f4-ad31-a9fa27185103 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... clearing needsrepair flag and regenerating metadata sb_fdblocks 1709737, counted 1762490 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... unknown block state, ag 1, blocks 555-1031 - check for inodes claiming duplicate blocks... - agno = 1 - agno = 2 - agno = 0 entry "thanos.shipper.json" in shortform directory 131 references free inode 137 junking entry "thanos.shipper.json" in directory inode 131 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... disconnected inode 134, moving to lost+found Phase 7 - verify and correct link counts... Maximum metadata LSN (6:55208) is ahead of log (1:8). Format log to cycle 9. done Then restart the pod, and it should run successfully.