Skip to content

Commit 656dc79

Browse files
apurvabhide17anarnold97
authored andcommitted
OADP-6294: Mod-work for the OADP Troubleshooting user story
1 parent baea5be commit 656dc79

File tree

43 files changed

+583
-543
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+583
-543
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3751,8 +3751,8 @@ Topics:
37513751
File: velero-cli-tool
37523752
- Name: Pods crash or restart due to lack of memory or CPU
37533753
File: pods-crash-or-restart-due-to-lack-of-memory-or-cpu
3754-
- Name: Issues with Velero and admission webhooks
3755-
File: issues-with-velero-and-admission-webhooks
3754+
- Name: Restoring workarounds for Velero backups that use admission webhooks
3755+
File: restoring-workarounds-for-velero-backups-that-use-admission-webhooks
37563756
- Name: OADP installation issues
37573757
File: oadp-installation-issues
37583758
- Name: OADP Operator issues

backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ The default plugins enable Velero to integrate with certain cloud providers and
1313
include::modules/oadp-features.adoc[leveloffset=+1]
1414
include::modules/oadp-plugins.adoc[leveloffset=+1]
1515
include::modules/oadp-configuring-velero-plugins.adoc[leveloffset=+1]
16-
include::modules/oadp-plugins-receiving-eof-message.adoc[leveloffset=+2]
1716
ifndef::openshift-rosa,openshift-rosa-hcp[]
1817
include::modules/oadp-supported-architecture.adoc[leveloffset=+1]
1918
endif::openshift-rosa,openshift-rosa-hcp[]
@@ -34,8 +33,9 @@ include::modules/oadp-ibm-z-test-support.adoc[leveloffset=+2]
3433
include::modules/oadp-ibm-power-and-z-known-issues.adoc[leveloffset=+3]
3534
endif::openshift-rosa,openshift-rosa-hcp[]
3635

37-
include::modules/oadp-features-plugins-known-issues.adoc[leveloffset=+1]
38-
3936
include::modules/oadp-fips.adoc[leveloffset=+1]
4037

38+
include::modules/avoiding-the-velero-plugin-panic-error.adoc[leveloffset=+1]
39+
include::modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc[leveloffset=+1]
40+
4141
:!oadp-features-plugins:

backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc

Lines changed: 7 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -9,89 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]
99

1010
toc::[]
1111

12-
You might encounter these common issues with `Backup` and `Restore` custom resources (CRs).
12+
You might encounter the following common issues with `Backup` and `Restore` custom resources (CRs):
1313

14-
[id="backup-cannot-retrieve-volume_{context}"]
15-
== Backup CR cannot retrieve volume
14+
* Backup CR cannot retrieve volume
15+
* Backup CR status remains in progress
16+
* Backup CR status remains in the `PartiallyFailed` phase/state/etc
1617
17-
The `Backup` CR displays the following error message: `InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist`.
18+
include::modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc[leveloffset=+1]
1819

19-
.Cause
20+
include::modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc[leveloffset=+1]
2021

21-
The persistent volume (PV) and the snapshot locations are in different regions.
22-
23-
.Solution
24-
25-
. Edit the value of the `spec.snapshotLocations.velero.config.region` key in the `DataProtectionApplication` manifest so that the snapshot location is in the same region as the PV.
26-
. Create a new `Backup` CR.
27-
28-
[id="backup-cr-remains-in-progress_{context}"]
29-
== Backup CR status remains in progress
30-
31-
The status of a `Backup` CR remains in the `InProgress` phase and does not complete.
32-
33-
.Cause
34-
35-
If a backup is interrupted, it cannot be resumed.
36-
37-
.Solution
38-
39-
. Retrieve the details of the `Backup` CR by running the following command:
40-
+
41-
[source,terminal]
42-
----
43-
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
44-
backup describe <backup>
45-
----
46-
47-
. Delete the `Backup` CR by running the following command:
48-
+
49-
[source,terminal]
50-
----
51-
$ oc delete backups.velero.io <backup> -n openshift-adp
52-
----
53-
+
54-
You do not need to clean up the backup location because an in progress `Backup` CR has not uploaded files to object storage.
55-
56-
. Create a new `Backup` CR.
57-
58-
. View the Velero backup details by running the following command:
59-
+
60-
[source,terminal, subs="+quotes"]
61-
----
62-
$ velero backup describe _<backup-name>_ --details
63-
----
64-
65-
[id="backup-cr-remains-partiallyfailed_{context}"]
66-
== Backup CR status remains in PartiallyFailed
67-
68-
The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and is not completed. A snapshot of the affiliated PVC is not created.
69-
70-
.Cause
71-
72-
If the backup created based on the CSI snapshot class is missing a label, the CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following message:
73-
74-
[source,text]
75-
----
76-
time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
77-
----
78-
79-
.Solution
80-
81-
. Delete the `Backup` CR by running the following command::
82-
+
83-
[source,terminal]
84-
----
85-
$ oc delete backups.velero.io <backup> -n openshift-adp
86-
----
87-
88-
. If required, clean up the stored data on the `BackupStorageLocation` to free up space.
89-
90-
. Apply the label `velero.io/csi-volumesnapshot-class=true` to the `VolumeSnapshotClass` object by running the following command:
91-
+
92-
[source,terminal]
93-
----
94-
$ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
95-
----
96-
97-
. Create a new `Backup` CR.
22+
include::modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc[leveloffset=+1]

backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc

Lines changed: 3 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -9,41 +9,7 @@ include::_attributes/attributes-openshift-dedicated.adoc[]
99

1010
toc::[]
1111

12-
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.
12+
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application (DPA).
1313

14-
[id="oadp-backup-location-contains-invalid-directories_{context}"]
15-
== Backup storage contains invalid directories
16-
17-
The `Velero` pod log displays the following error message: `Backup storage contains invalid top-level directories`.
18-
19-
.Cause
20-
21-
The object storage contains top-level directories that are not Velero directories.
22-
23-
.Solution
24-
25-
If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the `spec.backupLocations.velero.objectStorage.prefix` parameter in the `DataProtectionApplication` manifest.
26-
27-
[id="oadp-incorrect-aws-credentials_{context}"]
28-
== Incorrect AWS credentials
29-
30-
The `oadp-aws-registry` pod log displays the following error message: `InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.`
31-
32-
The `Velero` pod log displays the following error message: `NoCredentialProviders: no valid providers in chain`.
33-
34-
.Cause
35-
36-
The `credentials-velero` file used to create the `Secret` object is incorrectly formatted.
37-
38-
.Solution
39-
40-
Ensure that the `credentials-velero` file is correctly formatted, as in the following example:
41-
42-
.Example `credentials-velero` file
43-
----
44-
[default] <1>
45-
aws_access_key_id=AKIAIOSFODNN7EXAMPLE <2>
46-
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
47-
----
48-
<1> AWS default profile.
49-
<2> Do not enclose the values with quotation marks (`"`, `'`).
14+
include::modules/resolving-backup-storage-contains-invalid-directories-issue.adoc[leveloffset=+1]
15+
include::modules/resolving-incorrect-aws-credentials-issue.adoc[leveloffset=+1]

backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc

Lines changed: 1 addition & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -11,83 +11,4 @@ toc::[]
1111

1212
The {oadp-first} Operator might encounter issues caused by problems it is not able to resolve.
1313

14-
[id="oadp-operator-fails-silently_{context}"]
15-
== OADP Operator fails silently
16-
17-
The S3 buckets of an OADP Operator might be empty, but when you run the command `oc get po -n <oadp_operator_namespace>`, you see that the Operator has a status of `Running`. In such a case, the Operator is said to have _failed silently_ because it incorrectly reports that it is running.
18-
19-
.Cause
20-
21-
The problem is caused when cloud credentials provide insufficient permissions.
22-
23-
.Solution
24-
25-
Retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues.
26-
27-
.Procedure
28-
29-
. Retrieve a list of BSLs by using either the OpenShift or Velero command-line interface (CLI):
30-
.. Retrieve a list of BSLs by using the OpenShift CLI (`oc`):
31-
+
32-
[source,terminal]
33-
----
34-
$ oc get backupstoragelocations.velero.io -A
35-
----
36-
.. Retrieve a list of BSLs by using the `velero` CLI:
37-
+
38-
[source,terminal]
39-
----
40-
$ velero backup-location get -n <oadp_operator_namespace>
41-
----
42-
43-
. Use the list of BSLs from the previous step and run the following command to examine the manifest of each BSL for an error:
44-
+
45-
[source,terminal]
46-
----
47-
$ oc get backupstoragelocations.velero.io -n <namespace> -o yaml
48-
----
49-
+
50-
.Example result
51-
[source, yaml]
52-
----
53-
apiVersion: v1
54-
items:
55-
- apiVersion: velero.io/v1
56-
kind: BackupStorageLocation
57-
metadata:
58-
creationTimestamp: "2023-11-03T19:49:04Z"
59-
generation: 9703
60-
name: example-dpa-1
61-
namespace: openshift-adp-operator
62-
ownerReferences:
63-
- apiVersion: oadp.openshift.io/v1alpha1
64-
blockOwnerDeletion: true
65-
controller: true
66-
kind: DataProtectionApplication
67-
name: example-dpa
68-
uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82
69-
resourceVersion: "24273698"
70-
uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83
71-
spec:
72-
config:
73-
enableSharedConfig: "true"
74-
region: us-west-2
75-
credential:
76-
key: credentials
77-
name: cloud-credentials
78-
default: true
79-
objectStorage:
80-
bucket: example-oadp-operator
81-
prefix: example
82-
provider: aws
83-
status:
84-
lastValidationTime: "2023-11-10T22:06:46Z"
85-
message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc
86-
error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused
87-
by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus
88-
code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54"
89-
phase: Unavailable
90-
kind: List
91-
metadata:
92-
resourceVersion: ""
93-
----
14+
include::modules/resolving-oadp-operator-fails-silently-issue.adoc[leveloffset=+1]

backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,8 @@ include::_attributes/attributes-openshift-dedicated.adoc[]
1111

1212
toc::[]
1313

14-
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.
14+
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources. The values for the resource request fields must follow the same format as Kubernetes resource requirements.
1515

16-
The values for the resource request fields must follow the same format as Kubernetes resource requirements.
1716
If you do not specify `configuration.velero.podConfig.resourceAllocations` or `configuration.restic.podConfig.resourceAllocations`, see the following default `resources` specification configuration for a Velero or Restic pod:
1817

1918
[source,yaml]

backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc

Lines changed: 6 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -9,82 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[]
99

1010
toc::[]
1111

12-
You might encounter these issues when you back up applications with Restic.
12+
You might encounter the following issues when you back up applications with Restic:
1313

14-
[id="restic-permission-error-nfs-root-squash-enabled_{context}"]
15-
== Restic permission error for NFS data volumes with root_squash enabled
14+
* Restic permission error for NFS data volumes with the `root_squash` resource/parameter enabled
15+
* Restic `Backup` CR cannot be recreated after bucket is emptied
16+
* Restic restore partially failing on {product-title} 4.14 due to changed pod security admission (PSA) policy
1617
17-
The `Restic` pod log displays the following error message: `controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"`.
18+
include::modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc[leveloffset=+1]
1819

19-
.Cause
20-
21-
If your NFS data volumes have `root_squash` enabled, `Restic` maps to `nfsnobody` and does not have permission to create backups.
22-
23-
.Solution
24-
25-
You can resolve this issue by creating a supplemental group for `Restic` and adding the group ID to the `DataProtectionApplication` manifest:
26-
27-
. Create a supplemental group for `Restic` on the NFS data volume.
28-
. Set the `setgid` bit on the NFS directories so that group ownership is inherited.
29-
. Add the `spec.configuration.nodeAgent.supplementalGroups` parameter and the group ID to the `DataProtectionApplication` manifest, as shown in the following example:
30-
+
31-
[source,yaml]
32-
----
33-
apiVersion: oadp.openshift.io/v1alpha1
34-
kind: DataProtectionApplication
35-
# ...
36-
spec:
37-
configuration:
38-
nodeAgent:
39-
enable: true
40-
uploaderType: restic
41-
supplementalGroups:
42-
- <group_id> <1>
43-
# ...
44-
----
45-
<1> Specify the supplemental group ID.
46-
47-
. Wait for the `Restic` pods to restart so that the changes are applied.
48-
49-
[id="restic-backup-cannot-be-recreated-after-s3-bucket-emptied_{context}"]
50-
== Restic Backup CR cannot be recreated after bucket is emptied
51-
52-
If you create a Restic `Backup` CR for a namespace, empty the object storage bucket, and then recreate the `Backup` CR for the same namespace, the recreated `Backup` CR fails.
53-
54-
The `velero` pod log displays the following error message: `stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?`.
55-
56-
.Cause
57-
58-
Velero does not recreate or update the Restic repository from the `ResticRepository` manifest if the Restic directories are deleted from object storage. See link:https://github.com/vmware-tanzu/velero/issues/4421[Velero issue 4421] for more information.
59-
60-
.Solution
61-
62-
* Remove the related Restic repository from the namespace by running the following command:
63-
+
64-
[source,terminal]
65-
----
66-
$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>
67-
----
68-
+
69-
70-
In the following error log, `mysql-persistent` is the problematic Restic repository. The name of the repository appears in italics for clarity.
71-
+
72-
[source,text,options="nowrap",subs="+quotes,verbatim"]
73-
----
74-
time="2021-12-29T18:29:14Z" level=info msg="1 errors
75-
encountered backup up item" backup=velero/backup65
76-
logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
77-
time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
78-
backup=velero/backup65 error="pod volume backup failed: error running
79-
restic backup, stderr=Fatal: unable to open config file: Stat: The
80-
specified key does not exist.\nIs there a repository at the following
81-
location?\ns3:http://minio-minio.apps.mayap-oadp-
82-
veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
83-
restic/_mysql-persistent_\n: exit status 1" error.file="/remote-source/
84-
src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
85-
error.function="github.com/vmware-tanzu/velero/
86-
pkg/restic.(*backupper).BackupPodVolumes"
87-
logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds
88-
----
20+
include::modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc[leveloffset=+1]
8921

9022
include::modules/oadp-restic-restore-failing-psa-policy.adoc[leveloffset=+1]

0 commit comments

Comments
 (0)