Skip to content

[enterprise-4.20] cherry pick OADP#6294: Mod-work for the OADP Troubleshooting user story #96053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3751,8 +3751,8 @@ Topics:
File: velero-cli-tool
- Name: Pods crash or restart due to lack of memory or CPU
File: pods-crash-or-restart-due-to-lack-of-memory-or-cpu
- Name: Issues with Velero and admission webhooks
File: issues-with-velero-and-admission-webhooks
- Name: Restoring workarounds for Velero backups that use admission webhooks
File: restoring-workarounds-for-velero-backups-that-use-admission-webhooks
- Name: OADP installation issues
File: oadp-installation-issues
- Name: OADP Operator issues
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ The default plugins enable Velero to integrate with certain cloud providers and
include::modules/oadp-features.adoc[leveloffset=+1]
include::modules/oadp-plugins.adoc[leveloffset=+1]
include::modules/oadp-configuring-velero-plugins.adoc[leveloffset=+1]
include::modules/oadp-plugins-receiving-eof-message.adoc[leveloffset=+2]
ifndef::openshift-rosa,openshift-rosa-hcp[]
include::modules/oadp-supported-architecture.adoc[leveloffset=+1]
endif::openshift-rosa,openshift-rosa-hcp[]
Expand All @@ -34,8 +33,9 @@ include::modules/oadp-ibm-z-test-support.adoc[leveloffset=+2]
include::modules/oadp-ibm-power-and-z-known-issues.adoc[leveloffset=+3]
endif::openshift-rosa,openshift-rosa-hcp[]

include::modules/oadp-features-plugins-known-issues.adoc[leveloffset=+1]

include::modules/oadp-fips.adoc[leveloffset=+1]

include::modules/avoiding-the-velero-plugin-panic-error.adoc[leveloffset=+1]
include::modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc[leveloffset=+1]

:!oadp-features-plugins:
Original file line number Diff line number Diff line change
Expand Up @@ -9,89 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

You might encounter these common issues with `Backup` and `Restore` custom resources (CRs).
You might encounter the following common issues with `Backup` and `Restore` custom resources (CRs):

[id="backup-cannot-retrieve-volume_{context}"]
== Backup CR cannot retrieve volume
* Backup CR cannot retrieve volume
* Backup CR status remains in progress
* Backup CR status remains in the `PartiallyFailed` phase/state/etc

The `Backup` CR displays the following error message: `InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist`.
include::modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc[leveloffset=+1]

.Cause
include::modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc[leveloffset=+1]

The persistent volume (PV) and the snapshot locations are in different regions.

.Solution

. Edit the value of the `spec.snapshotLocations.velero.config.region` key in the `DataProtectionApplication` manifest so that the snapshot location is in the same region as the PV.
. Create a new `Backup` CR.

[id="backup-cr-remains-in-progress_{context}"]
== Backup CR status remains in progress

The status of a `Backup` CR remains in the `InProgress` phase and does not complete.

.Cause

If a backup is interrupted, it cannot be resumed.

.Solution

. Retrieve the details of the `Backup` CR by running the following command:
+
[source,terminal]
----
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe <backup>
----

. Delete the `Backup` CR by running the following command:
+
[source,terminal]
----
$ oc delete backups.velero.io <backup> -n openshift-adp
----
+
You do not need to clean up the backup location because an in progress `Backup` CR has not uploaded files to object storage.

. Create a new `Backup` CR.

. View the Velero backup details by running the following command:
+
[source,terminal, subs="+quotes"]
----
$ velero backup describe _<backup-name>_ --details
----

[id="backup-cr-remains-partiallyfailed_{context}"]
== Backup CR status remains in PartiallyFailed

The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and is not completed. A snapshot of the affiliated PVC is not created.

.Cause

If the backup created based on the CSI snapshot class is missing a label, the CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following message:

[source,text]
----
time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
----

.Solution

. Delete the `Backup` CR by running the following command::
+
[source,terminal]
----
$ oc delete backups.velero.io <backup> -n openshift-adp
----

. If required, clean up the stored data on the `BackupStorageLocation` to free up space.

. Apply the label `velero.io/csi-volumesnapshot-class=true` to the `VolumeSnapshotClass` object by running the following command:
+
[source,terminal]
----
$ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
----

. Create a new `Backup` CR.
include::modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
Expand Up @@ -9,41 +9,7 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application (DPA).

[id="oadp-backup-location-contains-invalid-directories_{context}"]
== Backup storage contains invalid directories

The `Velero` pod log displays the following error message: `Backup storage contains invalid top-level directories`.

.Cause

The object storage contains top-level directories that are not Velero directories.

.Solution

If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the `spec.backupLocations.velero.objectStorage.prefix` parameter in the `DataProtectionApplication` manifest.

[id="oadp-incorrect-aws-credentials_{context}"]
== Incorrect AWS credentials

The `oadp-aws-registry` pod log displays the following error message: `InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.`

The `Velero` pod log displays the following error message: `NoCredentialProviders: no valid providers in chain`.

.Cause

The `credentials-velero` file used to create the `Secret` object is incorrectly formatted.

.Solution

Ensure that the `credentials-velero` file is correctly formatted, as in the following example:

.Example `credentials-velero` file
----
[default] <1>
aws_access_key_id=AKIAIOSFODNN7EXAMPLE <2>
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
----
<1> AWS default profile.
<2> Do not enclose the values with quotation marks (`"`, `'`).
include::modules/resolving-backup-storage-contains-invalid-directories-issue.adoc[leveloffset=+1]
include::modules/resolving-incorrect-aws-credentials-issue.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
Expand Up @@ -11,83 +11,4 @@ toc::[]

The {oadp-first} Operator might encounter issues caused by problems it is not able to resolve.

[id="oadp-operator-fails-silently_{context}"]
== OADP Operator fails silently

The S3 buckets of an OADP Operator might be empty, but when you run the command `oc get po -n <oadp_operator_namespace>`, you see that the Operator has a status of `Running`. In such a case, the Operator is said to have _failed silently_ because it incorrectly reports that it is running.

.Cause

The problem is caused when cloud credentials provide insufficient permissions.

.Solution

Retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues.

.Procedure

. Retrieve a list of BSLs by using either the OpenShift or Velero command-line interface (CLI):
.. Retrieve a list of BSLs by using the OpenShift CLI (`oc`):
+
[source,terminal]
----
$ oc get backupstoragelocations.velero.io -A
----
.. Retrieve a list of BSLs by using the `velero` CLI:
+
[source,terminal]
----
$ velero backup-location get -n <oadp_operator_namespace>
----

. Use the list of BSLs from the previous step and run the following command to examine the manifest of each BSL for an error:
+
[source,terminal]
----
$ oc get backupstoragelocations.velero.io -n <namespace> -o yaml
----
+
.Example result
[source, yaml]
----
apiVersion: v1
items:
- apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
creationTimestamp: "2023-11-03T19:49:04Z"
generation: 9703
name: example-dpa-1
namespace: openshift-adp-operator
ownerReferences:
- apiVersion: oadp.openshift.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: DataProtectionApplication
name: example-dpa
uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82
resourceVersion: "24273698"
uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83
spec:
config:
enableSharedConfig: "true"
region: us-west-2
credential:
key: credentials
name: cloud-credentials
default: true
objectStorage:
bucket: example-oadp-operator
prefix: example
provider: aws
status:
lastValidationTime: "2023-11-10T22:06:46Z"
message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc
error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused
by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus
code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54"
phase: Unavailable
kind: List
metadata:
resourceVersion: ""
----
include::modules/resolving-oadp-operator-fails-silently-issue.adoc[leveloffset=+1]
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources. The values for the resource request fields must follow the same format as Kubernetes resource requirements.

The values for the resource request fields must follow the same format as Kubernetes resource requirements.
If you do not specify `configuration.velero.podConfig.resourceAllocations` or `configuration.restic.podConfig.resourceAllocations`, see the following default `resources` specification configuration for a Velero or Restic pod:

[source,yaml]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,82 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]

toc::[]

You might encounter these issues when you back up applications with Restic.
You might encounter the following issues when you back up applications with Restic:

[id="restic-permission-error-nfs-root-squash-enabled_{context}"]
== Restic permission error for NFS data volumes with root_squash enabled
* Restic permission error for NFS data volumes with the `root_squash` resource/parameter enabled
* Restic `Backup` CR cannot be recreated after bucket is emptied
* Restic restore partially failing on {product-title} 4.14 due to changed pod security admission (PSA) policy

The `Restic` pod log displays the following error message: `controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"`.
include::modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc[leveloffset=+1]

.Cause

If your NFS data volumes have `root_squash` enabled, `Restic` maps to `nfsnobody` and does not have permission to create backups.

.Solution

You can resolve this issue by creating a supplemental group for `Restic` and adding the group ID to the `DataProtectionApplication` manifest:

. Create a supplemental group for `Restic` on the NFS data volume.
. Set the `setgid` bit on the NFS directories so that group ownership is inherited.
. Add the `spec.configuration.nodeAgent.supplementalGroups` parameter and the group ID to the `DataProtectionApplication` manifest, as shown in the following example:
+
[source,yaml]
----
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
# ...
spec:
configuration:
nodeAgent:
enable: true
uploaderType: restic
supplementalGroups:
- <group_id> <1>
# ...
----
<1> Specify the supplemental group ID.

. Wait for the `Restic` pods to restart so that the changes are applied.

[id="restic-backup-cannot-be-recreated-after-s3-bucket-emptied_{context}"]
== Restic Backup CR cannot be recreated after bucket is emptied

If you create a Restic `Backup` CR for a namespace, empty the object storage bucket, and then recreate the `Backup` CR for the same namespace, the recreated `Backup` CR fails.

The `velero` pod log displays the following error message: `stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?`.

.Cause

Velero does not recreate or update the Restic repository from the `ResticRepository` manifest if the Restic directories are deleted from object storage. See link:https://github.com/vmware-tanzu/velero/issues/4421[Velero issue 4421] for more information.

.Solution

* Remove the related Restic repository from the namespace by running the following command:
+
[source,terminal]
----
$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>
----
+

In the following error log, `mysql-persistent` is the problematic Restic repository. The name of the repository appears in italics for clarity.
+
[source,text,options="nowrap",subs="+quotes,verbatim"]
----
time="2021-12-29T18:29:14Z" level=info msg="1 errors
encountered backup up item" backup=velero/backup65
logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
backup=velero/backup65 error="pod volume backup failed: error running
restic backup, stderr=Fatal: unable to open config file: Stat: The
specified key does not exist.\nIs there a repository at the following
location?\ns3:http://minio-minio.apps.mayap-oadp-
veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
restic/_mysql-persistent_\n: exit status 1" error.file="/remote-source/
src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
error.function="github.com/vmware-tanzu/velero/
pkg/restic.(*backupper).BackupPodVolumes"
logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds
----
include::modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc[leveloffset=+1]

include::modules/oadp-restic-restore-failing-psa-policy.adoc[leveloffset=+1]
Loading