diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 4c1558016579..203a0da642a9 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -3751,8 +3751,8 @@ Topics: File: velero-cli-tool - Name: Pods crash or restart due to lack of memory or CPU File: pods-crash-or-restart-due-to-lack-of-memory-or-cpu - - Name: Issues with Velero and admission webhooks - File: issues-with-velero-and-admission-webhooks + - Name: Restoring workarounds for Velero backups that use admission webhooks + File: restoring-workarounds-for-velero-backups-that-use-admission-webhooks - Name: OADP installation issues File: oadp-installation-issues - Name: OADP Operator issues diff --git a/backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc b/backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc index 1f03261bbe84..8a18b300ec59 100644 --- a/backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc +++ b/backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc @@ -13,7 +13,6 @@ The default plugins enable Velero to integrate with certain cloud providers and include::modules/oadp-features.adoc[leveloffset=+1] include::modules/oadp-plugins.adoc[leveloffset=+1] include::modules/oadp-configuring-velero-plugins.adoc[leveloffset=+1] -include::modules/oadp-plugins-receiving-eof-message.adoc[leveloffset=+2] ifndef::openshift-rosa,openshift-rosa-hcp[] include::modules/oadp-supported-architecture.adoc[leveloffset=+1] endif::openshift-rosa,openshift-rosa-hcp[] @@ -34,8 +33,9 @@ include::modules/oadp-ibm-z-test-support.adoc[leveloffset=+2] include::modules/oadp-ibm-power-and-z-known-issues.adoc[leveloffset=+3] endif::openshift-rosa,openshift-rosa-hcp[] -include::modules/oadp-features-plugins-known-issues.adoc[leveloffset=+1] - include::modules/oadp-fips.adoc[leveloffset=+1] +include::modules/avoiding-the-velero-plugin-panic-error.adoc[leveloffset=+1] +include::modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc[leveloffset=+1] + :!oadp-features-plugins: diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc index a146acb1d8b4..fa09dcdd5398 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc @@ -9,89 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[] toc::[] -You might encounter these common issues with `Backup` and `Restore` custom resources (CRs). +You might encounter the following common issues with `Backup` and `Restore` custom resources (CRs): -[id="backup-cannot-retrieve-volume_{context}"] -== Backup CR cannot retrieve volume +* Backup CR cannot retrieve volume +* Backup CR status remains in progress +* Backup CR status remains in the `PartiallyFailed` phase/state/etc -The `Backup` CR displays the following error message: `InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist`. +include::modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc[leveloffset=+1] -.Cause +include::modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc[leveloffset=+1] -The persistent volume (PV) and the snapshot locations are in different regions. - -.Solution - -. Edit the value of the `spec.snapshotLocations.velero.config.region` key in the `DataProtectionApplication` manifest so that the snapshot location is in the same region as the PV. -. Create a new `Backup` CR. - -[id="backup-cr-remains-in-progress_{context}"] -== Backup CR status remains in progress - -The status of a `Backup` CR remains in the `InProgress` phase and does not complete. - -.Cause - -If a backup is interrupted, it cannot be resumed. - -.Solution - -. Retrieve the details of the `Backup` CR by running the following command: -+ -[source,terminal] ----- -$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ - backup describe ----- - -. Delete the `Backup` CR by running the following command: -+ -[source,terminal] ----- -$ oc delete backups.velero.io -n openshift-adp ----- -+ -You do not need to clean up the backup location because an in progress `Backup` CR has not uploaded files to object storage. - -. Create a new `Backup` CR. - -. View the Velero backup details by running the following command: -+ -[source,terminal, subs="+quotes"] ----- -$ velero backup describe __ --details ----- - -[id="backup-cr-remains-partiallyfailed_{context}"] -== Backup CR status remains in PartiallyFailed - -The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and is not completed. A snapshot of the affiliated PVC is not created. - -.Cause - -If the backup created based on the CSI snapshot class is missing a label, the CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following message: - -[source,text] ----- -time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq ----- - -.Solution - -. Delete the `Backup` CR by running the following command:: -+ -[source,terminal] ----- -$ oc delete backups.velero.io -n openshift-adp ----- - -. If required, clean up the stored data on the `BackupStorageLocation` to free up space. - -. Apply the label `velero.io/csi-volumesnapshot-class=true` to the `VolumeSnapshotClass` object by running the following command: -+ -[source,terminal] ----- -$ oc label volumesnapshotclass/ velero.io/csi-volumesnapshot-class=true ----- - -. Create a new `Backup` CR. \ No newline at end of file +include::modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc[leveloffset=+1] \ No newline at end of file diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc index 539f024c3453..60e49ab83256 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc @@ -9,41 +9,7 @@ include::_attributes/attributes-openshift-dedicated.adoc[] toc::[] -You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application. +You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application (DPA). -[id="oadp-backup-location-contains-invalid-directories_{context}"] -== Backup storage contains invalid directories - -The `Velero` pod log displays the following error message: `Backup storage contains invalid top-level directories`. - -.Cause - -The object storage contains top-level directories that are not Velero directories. - -.Solution - -If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the `spec.backupLocations.velero.objectStorage.prefix` parameter in the `DataProtectionApplication` manifest. - -[id="oadp-incorrect-aws-credentials_{context}"] -== Incorrect AWS credentials - -The `oadp-aws-registry` pod log displays the following error message: `InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.` - -The `Velero` pod log displays the following error message: `NoCredentialProviders: no valid providers in chain`. - -.Cause - -The `credentials-velero` file used to create the `Secret` object is incorrectly formatted. - -.Solution - -Ensure that the `credentials-velero` file is correctly formatted, as in the following example: - -.Example `credentials-velero` file ----- -[default] <1> -aws_access_key_id=AKIAIOSFODNN7EXAMPLE <2> -aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY ----- -<1> AWS default profile. -<2> Do not enclose the values with quotation marks (`"`, `'`). +include::modules/resolving-backup-storage-contains-invalid-directories-issue.adoc[leveloffset=+1] +include::modules/resolving-incorrect-aws-credentials-issue.adoc[leveloffset=+1] \ No newline at end of file diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc index 4e9c377b59ef..73759d8731a3 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc @@ -11,83 +11,4 @@ toc::[] The {oadp-first} Operator might encounter issues caused by problems it is not able to resolve. -[id="oadp-operator-fails-silently_{context}"] -== OADP Operator fails silently - -The S3 buckets of an OADP Operator might be empty, but when you run the command `oc get po -n `, you see that the Operator has a status of `Running`. In such a case, the Operator is said to have _failed silently_ because it incorrectly reports that it is running. - -.Cause - -The problem is caused when cloud credentials provide insufficient permissions. - -.Solution - -Retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues. - -.Procedure - -. Retrieve a list of BSLs by using either the OpenShift or Velero command-line interface (CLI): -.. Retrieve a list of BSLs by using the OpenShift CLI (`oc`): -+ -[source,terminal] ----- -$ oc get backupstoragelocations.velero.io -A ----- -.. Retrieve a list of BSLs by using the `velero` CLI: -+ -[source,terminal] ----- -$ velero backup-location get -n ----- - -. Use the list of BSLs from the previous step and run the following command to examine the manifest of each BSL for an error: -+ -[source,terminal] ----- -$ oc get backupstoragelocations.velero.io -n -o yaml ----- -+ -.Example result -[source, yaml] ----- -apiVersion: v1 -items: -- apiVersion: velero.io/v1 - kind: BackupStorageLocation - metadata: - creationTimestamp: "2023-11-03T19:49:04Z" - generation: 9703 - name: example-dpa-1 - namespace: openshift-adp-operator - ownerReferences: - - apiVersion: oadp.openshift.io/v1alpha1 - blockOwnerDeletion: true - controller: true - kind: DataProtectionApplication - name: example-dpa - uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82 - resourceVersion: "24273698" - uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83 - spec: - config: - enableSharedConfig: "true" - region: us-west-2 - credential: - key: credentials - name: cloud-credentials - default: true - objectStorage: - bucket: example-oadp-operator - prefix: example - provider: aws - status: - lastValidationTime: "2023-11-10T22:06:46Z" - message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc - error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused - by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus - code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54" - phase: Unavailable -kind: List -metadata: - resourceVersion: "" ----- \ No newline at end of file +include::modules/resolving-oadp-operator-fails-silently-issue.adoc[leveloffset=+1] \ No newline at end of file diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc index a6b840eee9b4..6afdf329f145 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc @@ -11,9 +11,8 @@ include::_attributes/attributes-openshift-dedicated.adoc[] toc::[] -If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources. +If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources. The values for the resource request fields must follow the same format as Kubernetes resource requirements. -The values for the resource request fields must follow the same format as Kubernetes resource requirements. If you do not specify `configuration.velero.podConfig.resourceAllocations` or `configuration.restic.podConfig.resourceAllocations`, see the following default `resources` specification configuration for a Velero or Restic pod: [source,yaml] diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc index d0f17e9be6d3..e3f844a2d3e9 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc @@ -9,82 +9,14 @@ include::_attributes/attributes-openshift-dedicated.adoc[] toc::[] -You might encounter these issues when you back up applications with Restic. +You might encounter the following issues when you back up applications with Restic: -[id="restic-permission-error-nfs-root-squash-enabled_{context}"] -== Restic permission error for NFS data volumes with root_squash enabled +* Restic permission error for NFS data volumes with the `root_squash` resource/parameter enabled +* Restic `Backup` CR cannot be recreated after bucket is emptied +* Restic restore partially failing on {product-title} 4.14 due to changed pod security admission (PSA) policy -The `Restic` pod log displays the following error message: `controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"`. +include::modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc[leveloffset=+1] -.Cause - -If your NFS data volumes have `root_squash` enabled, `Restic` maps to `nfsnobody` and does not have permission to create backups. - -.Solution - -You can resolve this issue by creating a supplemental group for `Restic` and adding the group ID to the `DataProtectionApplication` manifest: - -. Create a supplemental group for `Restic` on the NFS data volume. -. Set the `setgid` bit on the NFS directories so that group ownership is inherited. -. Add the `spec.configuration.nodeAgent.supplementalGroups` parameter and the group ID to the `DataProtectionApplication` manifest, as shown in the following example: -+ -[source,yaml] ----- -apiVersion: oadp.openshift.io/v1alpha1 -kind: DataProtectionApplication -# ... -spec: - configuration: - nodeAgent: - enable: true - uploaderType: restic - supplementalGroups: - - <1> -# ... ----- -<1> Specify the supplemental group ID. - -. Wait for the `Restic` pods to restart so that the changes are applied. - -[id="restic-backup-cannot-be-recreated-after-s3-bucket-emptied_{context}"] -== Restic Backup CR cannot be recreated after bucket is emptied - -If you create a Restic `Backup` CR for a namespace, empty the object storage bucket, and then recreate the `Backup` CR for the same namespace, the recreated `Backup` CR fails. - -The `velero` pod log displays the following error message: `stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?`. - -.Cause - -Velero does not recreate or update the Restic repository from the `ResticRepository` manifest if the Restic directories are deleted from object storage. See link:https://github.com/vmware-tanzu/velero/issues/4421[Velero issue 4421] for more information. - -.Solution - -* Remove the related Restic repository from the namespace by running the following command: -+ -[source,terminal] ----- -$ oc delete resticrepository openshift-adp ----- -+ - -In the following error log, `mysql-persistent` is the problematic Restic repository. The name of the repository appears in italics for clarity. -+ -[source,text,options="nowrap",subs="+quotes,verbatim"] ----- - time="2021-12-29T18:29:14Z" level=info msg="1 errors - encountered backup up item" backup=velero/backup65 - logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds - time="2021-12-29T18:29:14Z" level=error msg="Error backing up item" - backup=velero/backup65 error="pod volume backup failed: error running - restic backup, stderr=Fatal: unable to open config file: Stat: The - specified key does not exist.\nIs there a repository at the following - location?\ns3:http://minio-minio.apps.mayap-oadp- - veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/ - restic/_mysql-persistent_\n: exit status 1" error.file="/remote-source/ - src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184" - error.function="github.com/vmware-tanzu/velero/ - pkg/restic.(*backupper).BackupPodVolumes" - logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds ----- +include::modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc[leveloffset=+1] include::modules/oadp-restic-restore-failing-psa-policy.adoc[leveloffset=+1] \ No newline at end of file diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/issues-with-velero-and-admission-webhooks.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc similarity index 66% rename from backup_and_restore/application_backup_and_restore/troubleshooting/issues-with-velero-and-admission-webhooks.adoc rename to backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc index 59ab72e648d5..2259b489f7a2 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/issues-with-velero-and-admission-webhooks.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc @@ -1,9 +1,9 @@ :_mod-docs-content-type: ASSEMBLY -[id="issues-with-velero-and-admission-webhooks"] -= Issues with Velero and admission webhooks +[id="restoring-workarounds-for-velero-backups-that-use-admission-webhooks"] += Restoring workarounds for Velero backups that use admission webhooks include::_attributes/common-attributes.adoc[] include::_attributes/attributes-openshift-dedicated.adoc[] -:context: issues-with-velero-and-admission-webhooks +:context: restoring-workarounds-for-velero-backups-that-use-admission-webhooks :namespace: openshift-adp :local-product: OADP @@ -15,19 +15,20 @@ Typically, workloads with admission webhooks require you to create a resource of For example, creating or restoring a top-level object such as `service.serving.knative.dev` typically creates child resources automatically. If you do this first, you will not need to use Velero to create and restore these resources. This avoids the problem of child resources being blocked by an admission webhook that Velero might use. -[id="velero-restore-workarounds-for-workloads-with-admission-webhooks_{context}"] -== Restoring workarounds for Velero backups that use admission webhooks +[NOTE] +==== +Velero plugins are started as separate processes. After a Velero operation has completed, either successfully or not, it exits. +Receiving a `received EOF, stopping recv loop` message in the debug logs indicates that a plugin operation has completed. It does not mean that an error has occurred. +==== -You need additional steps to restore resources for several types of Velero backups that use admission webhooks. +include::modules/migration-debugging-velero-admission-webhooks-knative.adoc[leveloffset=+1] +include::modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc[leveloffset=+1] +include::modules/avoiding-the-velero-plugin-panic-error.adoc[leveloffset=+1] +include::modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc[leveloffset=+1] -include::modules/migration-debugging-velero-admission-webhooks-knative.adoc[leveloffset=+2] -include::modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc[leveloffset=+2] -include::modules/oadp-features-plugins-known-issues.adoc[leveloffset=+1] -include::modules/oadp-plugins-receiving-eof-message.adoc[leveloffset=+1] [role="_additional-resources"] .Additional resources - * xref:../../../architecture/admission-plug-ins.adoc#admission-plug-ins[Admission plugins] * xref:../../../architecture/admission-plug-ins.adoc#admission-webhooks-about_admission-plug-ins[Webhook admission plugins] * xref:../../../architecture/admission-plug-ins.adoc#admission-webhook-types_admission-plug-ins[Types of webhook admission plugins] diff --git a/backup_and_restore/application_backup_and_restore/troubleshooting/troubleshooting.adoc b/backup_and_restore/application_backup_and_restore/troubleshooting/troubleshooting.adoc index cec7d3222c29..216ac153bff6 100644 --- a/backup_and_restore/application_backup_and_restore/troubleshooting/troubleshooting.adoc +++ b/backup_and_restore/application_backup_and_restore/troubleshooting/troubleshooting.adoc @@ -18,7 +18,7 @@ You can troubleshoot OADP issues by using the following methods: * Debug Velero or Restic pod crashes, which are caused due to a lack of memory or CPU by using xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc#pods-crash-or-restart-due-to-lack-of-memory-or-cpu[Pods crash or restart due to lack of memory or CPU]. -* Debug issues with Velero and admission webhooks by using xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/issues-with-velero-and-admission-webhooks.adoc#issues-with-velero-and-admission-webhooks[Issues with Velero and admission webhooks]. +* Debug issues with Velero and admission webhooks by using xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc#restoring-workarounds-for-velero-backups-that-use-admission-webhooks[Restoring workarounds for Velero backups that use admission webhooks]. * Check xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc#oadp-installation-issues[OADP installation issues], xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc#oadp-operator-issues[OADP Operator issues], xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc#backup-and-restore-cr-issues[backup and restore CR issues], and xref:../../../backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc#restic-issues[Restic issues]. diff --git a/modules/avoiding-the-velero-plugin-panic-error.adoc b/modules/avoiding-the-velero-plugin-panic-error.adoc new file mode 100644 index 000000000000..39e0a53ad9df --- /dev/null +++ b/modules/avoiding-the-velero-plugin-panic-error.adoc @@ -0,0 +1,51 @@ +// Module included in the following assemblies: +// oadp-features-plugins-known-issues +// * backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="avoiding-the-velero-plugin-panic-error_{context}"] += Avoiding the Velero plugin panic error + +A missing secret can cause a panic error for the Velero plugin during image stream backups. + +When the backup and the Backup Storage Location (BSL) are managed outside the scope of the Data Protection Application (DPA), the OADP controller does not create the relevant `oadp---registry-secret` parameter. + +During the backup operation, the OpenShift Velero plugin panics on the imagestream backup, with the following panic error: + +[source,text] +---- +024-02-27T10:46:50.028951744Z time="2024-02-27T10:46:50Z" level=error msg="Error backing up item" +backup=openshift-adp/ error="error executing custom action (groupResource=imagestreams.image.openshift.io, +namespace=, name=postgres): rpc error: code = Aborted desc = plugin panicked: +runtime error: index out of range with length 1, stack trace: goroutine 94… +---- + +Use the following workaround to avoid the Velero plugin panic error. + +.Procedure + +. Label the custom BSL with the relevant label by using the following command: ++ +[source,terminal] +---- +$ oc label backupstoragelocations.velero.io app.kubernetes.io/component=bsl +---- + +. After the BSL is labeled, wait until the DPA reconciles. ++ +[NOTE] +==== +You can force the reconciliation by making any minor change to the DPA itself. +==== + + +.Verification + +* After the DPA is reconciled, confirm that the parameter has been created and that the correct registry data has been populated into it by entering the following command: ++ +[source,terminal] +---- +$ oc -n openshift-adp get secret/oadp---registry-secret -o json | jq -r '.data' +---- \ No newline at end of file diff --git a/modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc b/modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc index 2b81d5ef8a1e..fe801e36b892 100644 --- a/modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc +++ b/modules/migration-debugging-velero-admission-webhooks-ibm-appconnect.adoc @@ -1,6 +1,6 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc :_mod-docs-content-type: PROCEDURE [id="migration-debugging-velero-admission-webhooks-ibm-appconnect_{context}"] = Restoring {ibm-title} AppConnect resources @@ -9,7 +9,7 @@ If you experience issues when you use Velero to a restore an {ibm-name} AppConne .Procedure -. Check if you have any mutating admission plugins of `kind: MutatingWebhookConfiguration` in the cluster: +. Check if you have any mutating admission plugins of `kind: MutatingWebhookConfiguration` in the cluster by entering/running the following command: + [source,terminal] ---- diff --git a/modules/migration-debugging-velero-admission-webhooks-knative.adoc b/modules/migration-debugging-velero-admission-webhooks-knative.adoc index 489975700820..8647f9c10e67 100644 --- a/modules/migration-debugging-velero-admission-webhooks-knative.adoc +++ b/modules/migration-debugging-velero-admission-webhooks-knative.adoc @@ -1,21 +1,22 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc +// :_mod-docs-content-type: PROCEDURE [id="migration-debugging-velero-admission-webhooks-knative_{context}"] = Restoring Knative resources You might encounter problems using Velero to back up Knative resources that use admission webhooks. -You can avoid such problems by restoring the top level `Service` resource first whenever you back up and restore Knative resources that use admission webhooks. +You can avoid such problems by restoring the top level `Service` resource whenever you back up and restore Knative resources that use admission webhooks. .Procedure -* Restore the top level `service.serving.knavtive.dev Service` resource: +* Restore the top level `service.serving.knavtive.dev Service` resource by using the following command: + [source,terminal] ---- $ velero restore \ --from-backup= --include-resources \ service.serving.knavtive.dev ----- +---- \ No newline at end of file diff --git a/modules/migration-debugging-velero-resources.adoc b/modules/migration-debugging-velero-resources.adoc index 03862d5020c7..363e7f74191c 100644 --- a/modules/migration-debugging-velero-resources.adoc +++ b/modules/migration-debugging-velero-resources.adoc @@ -1,99 +1,87 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc // * migrating_from_ocp_3_to_4/troubleshooting-3-4.adoc // * migration_toolkit_for_containers/troubleshooting-mtc - +:_mod-docs-content-type: PROCEDURE [id="migration-debugging-velero-resources_{context}"] = Debugging Velero resources with the Velero CLI tool -You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool. - -The Velero CLI tool provides more detailed information than the OpenShift CLI tool. - -[discrete] -[id="velero-command-syntax_{context}"] -== Syntax +You can debug `Backup` and `Restore` custom resources (CRs) and retrieve logs with the Velero CLI tool. The Velero CLI tool provides more detailed information than the OpenShift CLI tool. -Use the `oc exec` command to run a Velero CLI command: +.Procedure +* Use the `oc exec` command to run a Velero CLI command: ++ [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ ---- - -.Example ++ +.Example `oc exec` command [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql ---- -[discrete] -[id="velero-help-option_{context}"] -== Help option - -Use the `velero --help` option to list all Velero CLI commands: - +* List all Velero CLI commands by using the following `velero --help` option: ++ [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ --help ---- - -[discrete] -[id="velero-describe-command_{context}"] -== Describe command - -Use the `velero describe` command to retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR: - +* Retrieve the logs of a `Backup` or `Restore` CR by using the following `velero logs` command: ++ [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ - describe + logs ---- - -.Example ++ +.Example `velero logs` command [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ - backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql + restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf ---- -The following types of restore errors and warnings are shown in the output of a `velero describe` request: - -* `Velero`: A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on -* `Cluster`: A list of messages related to backing up or restoring cluster-scoped resources -* `Namespaces`: A list of list of messages related to backing up or restoring resources stored in namespaces - -One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status. - -[IMPORTANT] -==== -* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check to see if the resource is actually in the cluster. - -* If there are `Velero` errors, but no resource-specific errors, in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications. +* Retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR by using the following `velero describe` command: + -For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored. -==== - -[discrete] -[id="velero-logs-command_{context}"] -== Logs command - -Use the `velero logs` command to retrieve the logs of a `Backup` or `Restore` CR: - [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ - logs + describe ---- - -.Example ++ +.Example `velero describe` command [source,terminal,subs="attributes+"] ---- $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ - restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf + backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql ---- ++ +The following types of restore errors and warnings are shown in the output of a `velero describe` request: ++ +.`Velero` +A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on ++ +.`Cluster` +A list of messages related to backing up or restoring cluster-scoped resources ++ +.`Namespaces` +A list of list of messages related to backing up or restoring resources stored in namespaces + ++ +One or more errors in one of these categories results in a `Restore` operation receiving the status of `PartiallyFailed` and not `Completed`. Warnings do not lead to a change in the completion status. ++ +Consider the following points for these restore errors: + +* For resource-specific errors, that is, `Cluster` and `Namespaces` errors, the `restore describe --details` output includes a resource list that includes all resources that Velero restored. For any resource that has such an error, check if the resource is actually in the cluster. + +* If there are `Velero` errors but no resource-specific errors in the output of a `describe` command, it is possible that the restore completed without any actual problems in restoring workloads. In this case, carefully validate post-restore applications. ++ +For example, if the output contains `PodVolumeRestore` or node agent-related errors, check the status of `PodVolumeRestores` and `DataDownloads`. If none of these are failed or still running, then volume data might have been fully restored. diff --git a/modules/oadp-creating-alerting-rule.adoc b/modules/oadp-creating-alerting-rule.adoc index 6a3b31cf4407..1bf7fc58023d 100644 --- a/modules/oadp-creating-alerting-rule.adoc +++ b/modules/oadp-creating-alerting-rule.adoc @@ -1,16 +1,16 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc :_mod-docs-content-type: PROCEDURE [id="creating-alerting-rules_{context}"] = Creating an alerting rule -The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring. +The {product-title} monitoring stack receives Alerts configured by using Alerting Rules. To create an Alerting rule for the {oadp-short} project, use one of the Metrics scraped with the user workload monitoring. .Procedure -. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`. +. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`: + .Sample `OADPBackupFailing` alert [source,yaml] @@ -38,9 +38,9 @@ spec: + In this sample, the Alert displays under the following conditions: + -* There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes. -* If the time of the first increase is less than 5 minutes, the Alert will be in a `Pending` state, after which it will turn into a `Firing` state. -+ +* During the last 2 hours, the number of new failing backups was greater than 0 and the state persisted for at least 5 minutes. +* If the time of the first increase is less than 5 minutes, the Alert is in a `Pending` state, after which it turns into a `Firing` state. + . Apply the `4_create_oadp_alert_rule.yaml` file, which creates the `PrometheusRule` object in the `openshift-adp` namespace: + [source,terminal] @@ -55,12 +55,11 @@ prometheusrule.monitoring.coreos.com/sample-oadp-alert created ---- .Verification + * After the Alert is triggered, you can view it in the following ways: ** In the *Developer* perspective, select the *Observe* menu. ** In the *Administrator* perspective under the *Observe* -> *Alerting* menu, select *User* in the *Filter* box. Otherwise, by default only the *Platform* Alerts are displayed. + .OADP backup failing alert -image::oadp-backup-failing-alert.png[OADP backup failing alert] - - +image::oadp-backup-failing-alert.png[OADP backup failing alert] \ No newline at end of file diff --git a/modules/oadp-creating-service-monitor.adoc b/modules/oadp-creating-service-monitor.adoc index cfe547bb8067..18c19374de4d 100644 --- a/modules/oadp-creating-service-monitor.adoc +++ b/modules/oadp-creating-service-monitor.adoc @@ -1,19 +1,17 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc :_mod-docs-content-type: PROCEDURE [id="oadp-creating-service-monitor_{context}"] = Creating OADP service monitor -OADP provides an `openshift-adp-velero-metrics-svc` service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service. - -Get details about the service by running the following commands: +{oadp-short} provides an `openshift-adp-velero-metrics-svc` service, which is created when the Data Protection Application (DPA) is configured. The user workload monitoring service monitor must point to the defined service. +To get details about the service, complete the following steps. .Procedure -. Ensure the `openshift-adp-velero-metrics-svc` service exists. It should contain `app.kubernetes.io/name=velero` label, which will be used as selector for the `ServiceMonitor` object. - +. Ensure that the `openshift-adp-velero-metrics-svc` service exists. It should contain `app.kubernetes.io/name=velero` label, which is used as selector for the `ServiceMonitor` object. + [source,terminal] ---- @@ -26,7 +24,7 @@ $ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE openshift-adp-velero-metrics-svc ClusterIP 172.30.38.244 8085/TCP 1h ---- -+ + . Create a `ServiceMonitor` YAML file that matches the existing service label, and save the file as `3_create_oadp_service_monitor.yaml`. The service monitor is created in the `openshift-adp` namespace where the `openshift-adp-velero-metrics-svc` service resides. + .Example `ServiceMonitor` object @@ -50,7 +48,7 @@ spec: matchLabels: app.kubernetes.io/name: "velero" ---- -+ + . Apply the `3_create_oadp_service_monitor.yaml` file: + [source,terminal] diff --git a/modules/oadp-csi-snapshot-timeouts.adoc b/modules/oadp-csi-snapshot-timeouts.adoc index 175a043f8eb6..b24e72da0523 100644 --- a/modules/oadp-csi-snapshot-timeouts.adoc +++ b/modules/oadp-csi-snapshot-timeouts.adoc @@ -1,10 +1,10 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="csisnapshot-timeout_{context}"] -= CSI snapshot timeout += Implementing CSI snapshot timeout `CSISnapshotTimeout` specifies the time during creation to wait until the `CSI VolumeSnapshot` status becomes `ReadyToUse`, before returning error as timeout. The default value is `10m`. @@ -19,7 +19,8 @@ Typically, the default value for `CSISnapshotTimeout` does not require adjustmen ==== .Procedure -* Edit the values in the `spec.csiSnapshotTimeout` block of the `Backup` CR manifest, as in the following example: + +* Edit the values in the `spec.csiSnapshotTimeout` block of the `Backup` CR manifest, as shown in the following example: + [source,yaml] ---- @@ -30,4 +31,4 @@ metadata: spec: csiSnapshotTimeout: 10m # ... ----- +---- \ No newline at end of file diff --git a/modules/oadp-datamover-timeouts.adoc b/modules/oadp-datamover-timeouts.adoc index f0d0d77823d2..0ca086e6128d 100644 --- a/modules/oadp-datamover-timeouts.adoc +++ b/modules/oadp-datamover-timeouts.adoc @@ -1,10 +1,10 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="datamover-timeout_{context}"] -= Data Mover timeout += Implementing Data Mover timeout `timeout` is a user-supplied timeout to complete `VolumeSnapshotBackup` and `VolumeSnapshotRestore`. The default value is `10m`. @@ -16,7 +16,8 @@ Use the Data Mover `timeout` for the following scenarios: * Only with OADP 1.1.x. .Procedure -* Edit the values in the `spec.features.dataMover.timeout` block of the `DataProtectionApplication` CR manifest, as in the following example: + +* Edit the values in the `spec.features.dataMover.timeout` block of the `DataProtectionApplication` CR manifest, as shown in the following example: + [source,yaml] ---- @@ -29,4 +30,4 @@ spec: dataMover: timeout: 10m # ... ----- +---- \ No newline at end of file diff --git a/modules/oadp-debugging-oc-cli.adoc b/modules/oadp-debugging-oc-cli.adoc index 3c470d2ad1d0..3718daf0b64f 100644 --- a/modules/oadp-debugging-oc-cli.adoc +++ b/modules/oadp-debugging-oc-cli.adoc @@ -1,6 +1,6 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +//backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc :_mod-docs-content-type: REFERENCE [id="oadp-debugging-oc-cli_{context}"] @@ -8,39 +8,30 @@ You can debug a failed backup or restore by checking Velero custom resources (CRs) and the `Velero` pod log with the OpenShift CLI tool. -[discrete] -[id="oc-velero-cr_{context}"] -== Velero CRs - -Use the `oc describe` command to retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR: +.Procedure +* Retrieve a summary of warnings and errors associated with a `Backup` or `Restore` CR by using the following `oc describe` command: ++ [source,terminal] ---- $ oc describe ---- -[discrete] -[id="oc-velero-pod-logs_{context}"] -== Velero pod logs - -Use the `oc logs` command to retrieve the `Velero` pod logs: - +* Retrieve the `Velero` pod logs by using the following `oc logs` command: ++ [source,terminal] ---- $ oc logs pod/ ---- -[discrete] -[id="oc-velero-debug-logs_{context}"] -== Velero pod debug logs - -You can specify the Velero log level in the `DataProtectionApplication` resource as shown in the following example. - +* Specify the Velero log level in the `DataProtectionApplication` resource as shown in the following example. ++ [NOTE] ==== -This option is available starting from OADP 1.0.3. +This option is available starting from {oadp-short} 1.0.3. ==== - ++ +.Example Velero log level file [source,yaml] ---- apiVersion: oadp.openshift.io/v1alpha1 @@ -52,9 +43,8 @@ spec: velero: logLevel: warning ---- - ++ The following `logLevel` values are available: - * `trace` * `debug` * `info` @@ -62,5 +52,5 @@ The following `logLevel` values are available: * `error` * `fatal` * `panic` - -It is recommended to use the `info` `logLevel` value for most logs. ++ +Use the `info` `logLevel` value for most logs. \ No newline at end of file diff --git a/modules/oadp-features-plugins-known-issues.adoc b/modules/oadp-features-plugins-known-issues.adoc deleted file mode 100644 index 09c0d19aefa5..000000000000 --- a/modules/oadp-features-plugins-known-issues.adoc +++ /dev/null @@ -1,70 +0,0 @@ -// Module included in the following assemblies: -// oadp-features-plugins-known-issues -// * backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - -:_mod-docs-content-type: CONCEPT -[id="oadp-features-plugins-known-issues_{context}"] -= OADP plugins known issues - -The following section describes known issues in {oadp-first} plugins: - -[id="velero-plugin-panic_{context}"] -== Velero plugin panics during imagestream backups due to a missing secret - -When the backup and the Backup Storage Location (BSL) are managed outside the scope of the Data Protection Application (DPA), the OADP controller, meaning the DPA reconciliation does not create the relevant `oadp---registry-secret`. - -When the backup is run, the OpenShift Velero plugin panics on the imagestream backup, with the following panic error: - -[source,terminal] ----- -024-02-27T10:46:50.028951744Z time="2024-02-27T10:46:50Z" level=error msg="Error backing up item" -backup=openshift-adp/ error="error executing custom action (groupResource=imagestreams.image.openshift.io, -namespace=, name=postgres): rpc error: code = Aborted desc = plugin panicked: -runtime error: index out of range with length 1, stack trace: goroutine 94… ----- - -[id="velero-plugin-panic-workaround_{context}"] -=== Workaround to avoid the panic error - -To avoid the Velero plugin panic error, perform the following steps: - -. Label the custom BSL with the relevant label: -+ -[source,terminal] ----- -$ oc label backupstoragelocations.velero.io app.kubernetes.io/component=bsl ----- - -. After the BSL is labeled, wait until the DPA reconciles. -+ -[NOTE] -==== -You can force the reconciliation by making any minor change to the DPA itself. -==== - -. When the DPA reconciles, confirm that the relevant `oadp---registry-secret` has been created and that the correct registry data has been populated into it: -+ -[source,terminal] ----- -$ oc -n openshift-adp get secret/oadp---registry-secret -o json | jq -r '.data' ----- - - -[id="openshift-adp-controller-manager-seg-fault_{context}"] -== OpenShift ADP Controller segmentation fault - -If you configure a DPA with both `cloudstorage` and `restic` enabled, the `openshift-adp-controller-manager` pod crashes and restarts indefinitely until the pod fails with a crash loop segmentation fault. - -You can have either `velero` or `cloudstorage` defined, because they are mutually exclusive fields. - -* If you have both `velero` and `cloudstorage` defined, the `openshift-adp-controller-manager` fails. -* If you have neither `velero` nor `cloudstorage` defined, the `openshift-adp-controller-manager` fails. - -For more information about this issue, see link:https://issues.redhat.com/browse/OADP-1054[OADP-1054]. - - -[id="openshift-adp-controller-manager-seg-fault-workaround_{context}"] -=== OpenShift ADP Controller segmentation fault workaround - -You must define either `velero` or `cloudstorage` when you configure a DPA. If you define both APIs in your DPA, the `openshift-adp-controller-manager` pod fails with a crash loop segmentation fault. diff --git a/modules/oadp-item-backup-timeouts.adoc b/modules/oadp-item-backup-timeouts.adoc index f7c698831131..34aa5fa38931 100644 --- a/modules/oadp-item-backup-timeouts.adoc +++ b/modules/oadp-item-backup-timeouts.adoc @@ -1,12 +1,12 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="item-operation-timeout-backup_{context}"] -= Item operation timeout - backup += Implementing item operation timeout - backup -`ItemOperationTimeout` specifies the time used to wait for asynchronous +The `ItemOperationTimeout` setting specifies the time used to wait for asynchronous `BackupItemAction` operations. The default value is `1h`. Use the backup `ItemOperationTimeout` for the following scenarios: @@ -15,7 +15,8 @@ Use the backup `ItemOperationTimeout` for the following scenarios: * For Data Mover uploads and downloads to or from the `BackupStorageLocation`. If the backup action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased. .Procedure -* Edit the values in the `Backup.spec.itemOperationTimeout` block of the `Backup` CR manifest, as in the following example: + +* Edit the values in the `Backup.spec.itemOperationTimeout` block of the `Backup` CR manifest, as shown in the following example: + [source,yaml] ---- @@ -26,5 +27,4 @@ metadata: spec: itemOperationTimeout: 1h # ... ----- - +---- \ No newline at end of file diff --git a/modules/oadp-item-restore-timeouts.adoc b/modules/oadp-item-restore-timeouts.adoc index 504b60578c37..36ee99779d8d 100644 --- a/modules/oadp-item-restore-timeouts.adoc +++ b/modules/oadp-item-restore-timeouts.adoc @@ -1,12 +1,12 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="item-operation-timeout-restore_{context}"] -= Item operation timeout - restore += Implementing item operation timeout - restore -`ItemOperationTimeout` specifies the time that is used to wait for `RestoreItemAction` operations. The default value is `1h`. +The `ItemOperationTimeout` setting specifies the time that is used to wait for `RestoreItemAction` operations. The default value is `1h`. Use the restore `ItemOperationTimeout` for the following scenarios: @@ -14,7 +14,8 @@ Use the restore `ItemOperationTimeout` for the following scenarios: * For Data Mover uploads and downloads to or from the `BackupStorageLocation`. If the restore action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased. .Procedure -* Edit the values in the `Restore.spec.itemOperationTimeout` block of the `Restore` CR manifest, as in the following example: + +* Edit the values in the `Restore.spec.itemOperationTimeout` block of the `Restore` CR manifest, as shown in the following example: + [source,yaml] ---- diff --git a/modules/oadp-list-of-metrics.adoc b/modules/oadp-list-of-metrics.adoc index aeb2b92c974b..2215af4ed6fb 100644 --- a/modules/oadp-list-of-metrics.adoc +++ b/modules/oadp-list-of-metrics.adoc @@ -1,6 +1,6 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc :_mod-docs-content-type: REFERENCE [id="list-of-metrics_{context}"] @@ -175,5 +175,4 @@ These are the list of metrics provided by the OADP together with their https://p |Total number of successful volume snapshots |Counter -|=== - +|=== \ No newline at end of file diff --git a/modules/oadp-monitoring-setup.adoc b/modules/oadp-monitoring-setup.adoc index 6234d9ff96a3..0eeb198312a1 100644 --- a/modules/oadp-monitoring-setup.adoc +++ b/modules/oadp-monitoring-setup.adoc @@ -1,6 +1,6 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-monitoring.adoc :_mod-docs-content-type: PROCEDURE [id="oadp-monitoring-setup-monitor_{context}"] @@ -13,19 +13,20 @@ With enabled User Workload Monitoring, it is possible to configure and use any P Monitoring metrics requires enabling monitoring for the user-defined projects and creating a `ServiceMonitor` resource to scrape those metrics from the already enabled OADP service endpoint that resides in the `openshift-adp` namespace. .Prerequisites + * You have access to an {product-title} cluster using an account with `cluster-admin` permissions. * You have created a cluster monitoring config map. .Procedure -. Edit the `cluster-monitoring-config` `ConfigMap` object in the `openshift-monitoring` namespace: +. Edit the `cluster-monitoring-config` `ConfigMap` object in the `openshift-monitoring` namespace by using the following command: + [source,terminal] ---- $ oc edit configmap cluster-monitoring-config -n openshift-monitoring ---- -. Add or enable the `enableUserWorkload` option in the `data` section's `config.yaml` field: +. Add or enable the `enableUserWorkload` option in the `data` section's `config.yaml` field by using the following command: + [source,yaml] ---- @@ -39,7 +40,7 @@ metadata: ---- <1> Add this option or set to `true` -. Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the `openshift-user-workload-monitoring` namespace: +. Wait a short period to verify the User Workload Monitoring Setup by checking that the following components are up and running in the `openshift-user-workload-monitoring` namespace: + [source,terminal] ---- @@ -85,10 +86,10 @@ data: config.yaml: | ---- + -. Apply the `2_configure_user_workload_monitoring.yaml` file: +. Apply the `2_configure_user_workload_monitoring.yaml` file by using the following command: + [source,terminal] ---- $ oc apply -f 2_configure_user_workload_monitoring.yaml configmap/user-workload-monitoring-config created ----- +---- \ No newline at end of file diff --git a/modules/oadp-plugins-receiving-eof-message.adoc b/modules/oadp-plugins-receiving-eof-message.adoc deleted file mode 100644 index a180969362c4..000000000000 --- a/modules/oadp-plugins-receiving-eof-message.adoc +++ /dev/null @@ -1,13 +0,0 @@ -// Module included in the following assemblies: -// -// * backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc - -:_mod-docs-content-type: CONCEPT -[id="oadp-plugins-receiving-eof-message_{context}"] - -= Velero plugins returning "received EOF, stopping recv loop" message - -[NOTE] -==== -Velero plugins are started as separate processes. After the Velero operation has completed, either successfully or not, they exit. Receiving a `received EOF, stopping recv loop` message in the debug logs indicates that a plugin operation has completed. It does not mean that an error has occurred. -==== diff --git a/modules/oadp-pod-crash-set-resource-request-restic.adoc b/modules/oadp-pod-crash-set-resource-request-restic.adoc index d0a053cdb93e..a5a65708bbe6 100644 --- a/modules/oadp-pod-crash-set-resource-request-restic.adoc +++ b/modules/oadp-pod-crash-set-resource-request-restic.adoc @@ -1,7 +1,7 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc +// :_mod-docs-content-type: PROCEDURE [id="oadp-pod-crash-resource-request-retics_{context}"] = Setting resource requests for a Restic pod diff --git a/modules/oadp-pod-crash-set-resource-request-velero.adoc b/modules/oadp-pod-crash-set-resource-request-velero.adoc index eb06b152eeaa..9399443ce193 100644 --- a/modules/oadp-pod-crash-set-resource-request-velero.adoc +++ b/modules/oadp-pod-crash-set-resource-request-velero.adoc @@ -1,7 +1,7 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc +// :_mod-docs-content-type: PROCEDURE [id="oadp-pod-crash-resource-request-velero_{context}"] = Setting resource requests for a Velero pod diff --git a/modules/oadp-restic-restore-failing-psa-policy.adoc b/modules/oadp-restic-restore-failing-psa-policy.adoc index b67f0fc98fe1..9f643369d2ff 100644 --- a/modules/oadp-restic-restore-failing-psa-policy.adoc +++ b/modules/oadp-restic-restore-failing-psa-policy.adoc @@ -1,10 +1,14 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc +// :_mod-docs-content-type: PROCEDURE [id="oadp-restic-restore-failing-psa-policy_{context}"] -= Restic restore partially failing on OCP 4.14 due to changed PSA policy += Troubleshooting restic restore partially failed issue on {ocp} 4.14 due to changed PSA policy -{ocp} 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process.  +{ocp} 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process. -If a `SecurityContextConstraints` (SCC) resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied.  +If a `SecurityContextConstraints` (SCC) resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied. This issue arises due to the resource restore order of Velero. @@ -35,7 +39,7 @@ logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]", ---- -.Solution +.Procedure . In your DPA custom resource (CR), check or set the `restore-resource-priorities` field on the Velero server to ensure that `securitycontextconstraints` is listed in order before `pods` in the list of resources: + diff --git a/modules/oadp-restic-timeouts.adoc b/modules/oadp-restic-timeouts.adoc index eeb2b4c13c00..ff68aeb8cea2 100644 --- a/modules/oadp-restic-timeouts.adoc +++ b/modules/oadp-restic-timeouts.adoc @@ -1,10 +1,11 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/oadp-timeouts.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="restic-timeout_{context}"] -= Restic timeout += Implementing restic timeout The `spec.configuration.nodeAgent.timeout` parameter defines the Restic timeout. The default value is `1h`. diff --git a/modules/oadp-velero-default-timeouts.adoc b/modules/oadp-velero-default-timeouts.adoc index 6362ea8e4670..19e1f38fcbc1 100644 --- a/modules/oadp-velero-default-timeouts.adoc +++ b/modules/oadp-velero-default-timeouts.adoc @@ -1,12 +1,12 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="velero-default-item-operation-timeout_{context}"] -= Velero default item operation timeout += Implementing velero default item operation timeout -`defaultItemOperationTimeout` defines how long to wait on asynchronous `BackupItemActions` and `RestoreItemActions` to complete before timing out. The default value is `1h`. +The `defaultItemOperationTimeout` setting defines how long to wait on asynchronous `BackupItemActions` and `RestoreItemActions` to complete before timing out. The default value is `1h`. Use the `defaultItemOperationTimeout` for the following scenarios: @@ -15,7 +15,8 @@ Use the `defaultItemOperationTimeout` for the following scenarios: * When `defaultItemOperationTimeout` is defined in the Data Protection Application (DPA) using the `defaultItemOperationTimeout`, it applies to both backup and restore operations. You can use `itemOperationTimeout` to define only the backup or only the restore of those CRs, as described in the following "Item operation timeout - restore", and "Item operation timeout - backup" sections. .Procedure -* Edit the values in the `spec.configuration.velero.defaultItemOperationTimeout` block of the `DataProtectionApplication` CR manifest, as in the following example: + +* Edit the values in the `spec.configuration.velero.defaultItemOperationTimeout` block of the `DataProtectionApplication` CR manifest, as shown in the following example: + [source,yaml] ---- @@ -28,5 +29,4 @@ spec: velero: defaultItemOperationTimeout: 1h # ... ----- - +---- \ No newline at end of file diff --git a/modules/oadp-velero-timeouts.adoc b/modules/oadp-velero-timeouts.adoc index 7daa58918e6c..e2d1658c7667 100644 --- a/modules/oadp-velero-timeouts.adoc +++ b/modules/oadp-velero-timeouts.adoc @@ -1,23 +1,24 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc - +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-timeouts.adoc +// :_mod-docs-content-type: PROCEDURE [id="velero-timeout_{context}"] -= Velero resource timeout += Implementing velero resource timeout `resourceTimeout` defines how long to wait for several Velero resources before timeout occurs, such as Velero custom resource definition (CRD) availability, `volumeSnapshot` deletion, and repository availability. The default is `10m`. Use the `resourceTimeout` for the following scenarios: -* For backups with total PV data usage that is greater than 1TB. This parameter is used as a timeout value when Velero tries to clean up or delete the Container Storage Interface (CSI) snapshots, before marking the backup as complete. -** A sub-task of this cleanup tries to patch VSC and this timeout can be used for that task. -+ +* For backups with total PV data usage that is greater than 1 TB. This parameter is used as a timeout value when Velero tries to clean up or delete the Container Storage Interface (CSI) snapshots, before marking the backup as complete. +** A sub-task of this cleanup tries to patch VSC, and this timeout can be used for that task. + * To create or ensure a backup repository is ready for filesystem based backups for Restic or Kopia. * To check if the Velero CRD is available in the cluster before restoring the custom resource (CR) or resource from the backup. .Procedure -* Edit the values in the `spec.configuration.velero.resourceTimeout` block of the `DataProtectionApplication` CR manifest, as in the following example: + +* Edit the values in the `spec.configuration.velero.resourceTimeout` block of the `DataProtectionApplication` CR manifest, as shown in the following example: + [source,yaml] ---- @@ -30,4 +31,4 @@ spec: velero: resourceTimeout: 10m # ... ----- +---- \ No newline at end of file diff --git a/modules/resolving-backup-storage-contains-invalid-directories-issue.adoc b/modules/resolving-backup-storage-contains-invalid-directories-issue.adoc new file mode 100644 index 000000000000..1e3c2668ebf0 --- /dev/null +++ b/modules/resolving-backup-storage-contains-invalid-directories-issue.adoc @@ -0,0 +1,18 @@ +// Module included in the following assemblies: +// oadp-features-plugins-known-issues +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="resolving-backup-storage-contains-invalid-directories-issue_{context}"] += Resolving invalid directories in backup storage + +The object storage contains top-level directories that are not Velero directories. The `Velero` pod log displays the following error message: +[source,text] +---- +Backup storage contains invalid top-level directories. +---- + +.Procedure + +* If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the `spec.backupLocations.velero.objectStorage.prefix` parameter in the `DataProtectionApplication` manifest. \ No newline at end of file diff --git a/modules/resolving-incorrect-aws-credentials-issue.adoc b/modules/resolving-incorrect-aws-credentials-issue.adoc new file mode 100644 index 000000000000..029b3d42978e --- /dev/null +++ b/modules/resolving-incorrect-aws-credentials-issue.adoc @@ -0,0 +1,38 @@ +// Module included in the following assemblies: +// oadp-features-plugins-known-issues +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-installation-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="resolving-incorrect-aws-credentials-issue_{context}"] += Resolving incorrect {aws-short} credentials + +If the `credentials-velero` file that is used to create the `Secret` object is incorrectly formatted, multiple errors might occur, including the following examples: + +* The `oadp-aws-registry` pod log displays the following error message: ++ +[source,text] +---- +`InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.` +---- + +* The `Velero` pod log displays the following error message: ++ +[source,text] +---- +NoCredentialProviders: no valid providers in chain. +---- + + +.Procedure + +* Ensure that the `credentials-velero` file is correctly formatted, as shown in the following example: ++ +.Example `credentials-velero` file +---- +[default] <1> +aws_access_key_id=AKIAIOSFODNN7EXAMPLE <2> +aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY +---- +<1> {aws-short} default profile. +<2> Do not enclose the values with quotation marks (`"`, `'`). \ No newline at end of file diff --git a/modules/resolving-oadp-operator-fails-silently-issue.adoc b/modules/resolving-oadp-operator-fails-silently-issue.adoc new file mode 100644 index 000000000000..5f6a256c91dc --- /dev/null +++ b/modules/resolving-oadp-operator-fails-silently-issue.adoc @@ -0,0 +1,84 @@ +// Module included in the following assemblies: +// oadp-features-plugins-known-issues +// * backup_and_restore/application_backup_and_restore/troubleshooting/oadp-operator-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="resolving-oadp-operator-fails-silently-issue_{context}"] += Resolving silent failure of the OADP Operator + +The S3 buckets of an OADP Operator might be empty, but when you run the command `oc get po -n `, you see that the Operator has a status of `Running`. + +In such a case, the Operator is said to have _failed silently_ because it incorrectly reports that it is running. The problem is caused when cloud credentials provide insufficient permissions. + +To fix this issue, retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues. + +.Procedure + +. Retrieve a list of BSLs by using either the OpenShift or Velero command-line interface (CLI): + +.. Retrieve a list of BSLs by using the OpenShift CLI (`oc`): ++ +[source,terminal] +---- +$ oc get backupstoragelocations.velero.io -A +---- + +.. Retrieve a list of BSLs by using the `velero` CLI: ++ +[source,terminal] +---- +$ velero backup-location get -n +---- + +. Use the list of BSLs from the previous step and run the following command to examine the manifest of each BSL for an error: ++ +[source,terminal] +---- +$ oc get backupstoragelocations.velero.io -n -o yaml +---- ++ +.Example result +[source, yaml] +---- +apiVersion: v1 +items: +- apiVersion: velero.io/v1 + kind: BackupStorageLocation + metadata: + creationTimestamp: "2023-11-03T19:49:04Z" + generation: 9703 + name: example-dpa-1 + namespace: openshift-adp-operator + ownerReferences: + - apiVersion: oadp.openshift.io/v1alpha1 + blockOwnerDeletion: true + controller: true + kind: DataProtectionApplication + name: example-dpa + uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82 + resourceVersion: "24273698" + uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83 + spec: + config: + enableSharedConfig: "true" + region: us-west-2 + credential: + key: credentials + name: cloud-credentials + default: true + objectStorage: + bucket: example-oadp-operator + prefix: example + provider: aws + status: + lastValidationTime: "2023-11-10T22:06:46Z" + message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc + error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused + by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus + code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54" + phase: Unavailable +kind: List +metadata: + resourceVersion: "" +---- \ No newline at end of file diff --git a/modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc b/modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc new file mode 100644 index 000000000000..5f11a72beb96 --- /dev/null +++ b/modules/restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied.adoc @@ -0,0 +1,47 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="restic-backup-cr-cannot-be-recreated-after-bucket-is-emptied_{context}"] += Troubleshooting Restic Backup CR issue that cannot be re-created after bucket is emptied + +Velero does not re-create or update the Restic repository from the `ResticRepository` manifest if the Restic directories are deleted from object storage. For more information, see link:https://github.com/vmware-tanzu/velero/issues/4421[Velero issue 4421]. + +If you create a Restic `Backup` CR for a namespace, empty the object storage bucket, and then re-create the `Backup` CR for the same namespace, the re-created `Backup` CR fails. In this case, the `velero` pod log displays the following error message: ++ +.Sample error +[source,text] +---- +stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location? +---- + +.Procedure + +* Remove the related Restic repository from the namespace by running the following command: ++ +[source,terminal] +---- +$ oc delete resticrepository openshift-adp +---- ++ +In the following error log, `mysql-persistent` is the problematic Restic repository. The name of the repository appears in italics for clarity. ++ +[source,text,options="nowrap",subs="+quotes,verbatim"] +---- + time="2021-12-29T18:29:14Z" level=info msg="1 errors + encountered backup up item" backup=velero/backup65 + logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds + time="2021-12-29T18:29:14Z" level=error msg="Error backing up item" + backup=velero/backup65 error="pod volume backup failed: error running + restic backup, stderr=Fatal: unable to open config file: Stat: The + specified key does not exist.\nIs there a repository at the following + location?\ns3:http://minio-minio.apps.mayap-oadp- + veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/ + restic/_mysql-persistent_\n: exit status 1" error.file="/remote-source/ + src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184" + error.function="github.com/vmware-tanzu/velero/ + pkg/restic.(*backupper).BackupPodVolumes" + logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds +---- \ No newline at end of file diff --git a/modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc b/modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc new file mode 100644 index 000000000000..87516586f47f --- /dev/null +++ b/modules/restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled.adoc @@ -0,0 +1,43 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/restic-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="restic-permission-error-for-nfs-data-volumes-with-root-squash-enabled_{context}"] += Troubleshooting Restic permission errors for NFS data volumes + +If your NFS data volumes have the `root_squash` parameter enabled, `Restic` maps set to the `nfsnobody` value, and do not have permission to create backups, the Restic` pod log displays the following error message: + +.Sample error +[source,text] +---- +controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied". +---- +You can resolve this issue by creating a supplemental group for `Restic` and adding the group ID to the `DataProtectionApplication` manifest. + +.Procedure + +. Create a supplemental group for `Restic` on the NFS data volume. + +. Set the `setgid` bit on the NFS directories so that group ownership is inherited. + +. Add the `spec.configuration.nodeAgent.supplementalGroups` parameter and the group ID to the `DataProtectionApplication` manifest, as shown in the following example: ++ +[source,yaml] +---- +apiVersion: oadp.openshift.io/v1alpha1 +kind: DataProtectionApplication +# ... +spec: + configuration: + nodeAgent: + enable: true + uploaderType: restic + supplementalGroups: + - <1> +# ... +---- +<1> Specify the supplemental group ID. + +. Wait for the `Restic` pods to restart so that the changes are applied. \ No newline at end of file diff --git a/modules/setting-resource-requests-for-a-nodeagent-pod.adoc b/modules/setting-resource-requests-for-a-nodeagent-pod.adoc index b077cf12714a..e3d1d82b93e1 100644 --- a/modules/setting-resource-requests-for-a-nodeagent-pod.adoc +++ b/modules/setting-resource-requests-for-a-nodeagent-pod.adoc @@ -1,6 +1,7 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/pods-crash-or-restart-due-to-lack-of-memory-or-cpu +// * backup_and_restore/application_backup_and_restore/troubleshooting/pods-crash-or-restart-due-to-lack-of-memory-or-cpu.adoc +// :_mod-docs-content-type: PROCEDURE [id="setting-resource-requests-for-a-nodeagent-pod_{context}"] diff --git a/modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc b/modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc new file mode 100644 index 000000000000..25d5804acc4a --- /dev/null +++ b/modules/troubleshooting-backup-cr-cannot-retrieve-volume-issue.adoc @@ -0,0 +1,22 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="troubleshooting-backup-cr-cannot-retrieve-volume-issue_{context}"] += Troubleshooting issue where backup CR cannot retrieve volume + +If the persistent volume (PV) and the snapshot locations are in different regions, the `Backup` custom resource (CR) displays the following error message: + +.Sample error +[source,text] +---- +InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist. +---- + +.Procedure + +. Edit the value of the `spec.snapshotLocations.velero.config.region` key in the `DataProtectionApplication` manifest so that the snapshot location is in the same region as the PV. + +. Create a new `Backup` CR. \ No newline at end of file diff --git a/modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc b/modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc new file mode 100644 index 000000000000..8fe9134854cd --- /dev/null +++ b/modules/troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue.adoc @@ -0,0 +1,37 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="troubleshooting-backup-cr-status-remains-in-partiallyfailed-issue_{context}"] += Troubleshooting issue where backup CR status remains partially failed + +The status of a `Backup` CR without Restic in use remains in the `PartiallyFailed` phase and is not completed. A snapshot of the affiliated PVC is not created. + +If the backup created based on the CSI snapshot class is missing a label, the CSI snapshot plugin fails to create a snapshot. As a result, the `Velero` pod logs an error similar to the following message: + +[source,text] +---- +time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq +---- + +.Procedure + +. Delete the `Backup` CR by running the following command:: ++ +[source,terminal] +---- +$ oc delete backups.velero.io -n openshift-adp +---- + +. If required, clean up the stored data on the `BackupStorageLocation` resource to free up space. + +. Apply the `velero.io/csi-volumesnapshot-class=true` label to the `VolumeSnapshotClass` object by running the following command: ++ +[source,terminal] +---- +$ oc label volumesnapshotclass/ velero.io/csi-volumesnapshot-class=true +---- + +. Create a new `Backup` CR. \ No newline at end of file diff --git a/modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc b/modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc new file mode 100644 index 000000000000..36a192585f54 --- /dev/null +++ b/modules/troubleshooting-backup-cr-status-remains-in-progress-issue.adoc @@ -0,0 +1,38 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/application_backup_and_restore/troubleshooting/backup-and-restore-cr-issues.adoc +// +:_mod-docs-content-type: PROCEDURE + +[id="troubleshooting-backup-cr-status-remains-in-progress-issue_{context}"] += Troubleshooting issue where backup CR status remains in progress + +If a backup is interrupted, it cannot be resumed, and the status of a `Backup` customer resource (CR) remains in the `InProgress` phase and does not complete. + +.Procedure + +. Retrieve the details of the `Backup` CR by running the following command: ++ +[source,terminal] +---- +$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \ + backup describe +---- + +. Delete the `Backup` CR by running the following command: ++ +[source,terminal] +---- +$ oc delete backups.velero.io -n openshift-adp +---- ++ +You do not need to clean up the backup location because an in progress `Backup` CR has not uploaded files to object storage. + +. Create a new `Backup` CR. + +. View the Velero backup details by running the following command: ++ +[source,terminal, subs="+quotes"] +---- +$ velero backup describe --details +---- \ No newline at end of file diff --git a/modules/velero-oadp-version-relationship.adoc b/modules/velero-oadp-version-relationship.adoc index e2bb2efc5ad9..dbb82eb50cc6 100644 --- a/modules/velero-oadp-version-relationship.adoc +++ b/modules/velero-oadp-version-relationship.adoc @@ -1,8 +1,7 @@ // Module included in the following assemblies: // // backup_and_restore/application_backup_and_restore/installing/oadp-installing-operator.adoc -// backup_and_restore/application_backup_and_restore/troubleshooting.adoc -// +// backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc :_mod-docs-content-type: CONCEPT [id="velero-oadp-version-relationship_{context}"] diff --git a/modules/velero-obtaining-by-accessing-binary.adoc b/modules/velero-obtaining-by-accessing-binary.adoc index fc5bd4468e92..87895b7f7dc1 100644 --- a/modules/velero-obtaining-by-accessing-binary.adoc +++ b/modules/velero-obtaining-by-accessing-binary.adoc @@ -1,6 +1,6 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +//backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc :_mod-docs-content-type: PROCEDURE [id="velero-obtaining-by-accessing-binary_{context}"] @@ -14,9 +14,9 @@ You can use a shell command to access the Velero binary in the Velero deployment .Procedure -* Enter the following command to set the needed alias: +* Set the needed alias by using the following command: + [source,terminal] ---- $ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero' ----- +---- \ No newline at end of file diff --git a/modules/velero-obtaining-by-downloading.adoc b/modules/velero-obtaining-by-downloading.adoc index afa45a2f32d6..89a11ca584c1 100644 --- a/modules/velero-obtaining-by-downloading.adoc +++ b/modules/velero-obtaining-by-downloading.adoc @@ -1,14 +1,12 @@ // Module included in the following assemblies: // -// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/velero-cli-tool.adoc :_mod-docs-content-type: PROCEDURE [id="velero-obtaining-by-downloading_{context}"] = Downloading the Velero CLI tool -You can download and install the Velero CLI tool by following the instructions on the link:https://{velero-domain}/docs/v{velero-version}/basic-install/#install-the-cli[Velero documentation page]. - -The page includes instructions for: +You can download and install the Velero CLI tool by following the instructions on the link:https://{velero-domain}/docs/v{velero-version}/basic-install/#install-the-cli[Velero documentation page]. The page includes instructions for the following options: * macOS by using Homebrew * GitHub diff --git a/modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc b/modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc new file mode 100644 index 000000000000..fe2694728413 --- /dev/null +++ b/modules/workaround-for-openshift-adp-controller-segmentation-fault.adoc @@ -0,0 +1,18 @@ +// Module included in the following assemblies: +// oadp-features-plugins-known-issues +// * backup_and_restore/application_backup_and_restore/oadp-features-plugins.adoc +// * backup_and_restore/application_backup_and_restore/troubleshooting/restoring-workarounds-for-velero-backups-that-use-admission-webhooks.adoc +// +:_mod-docs-content-type: CONCEPT + +[id="workaround-for-openshift-adp-controller-segmentation-fault_{context}"] += Workaround for OpenShift ADP Controller segmentation fault + +If you configure a Data Protection Application (DPA) with both `cloudstorage` and `restic` enabled, the `openshift-adp-controller-manager` pod crashes and restarts indefinitely until the pod fails with a crash loop segmentation fault. + +Define either `velero` or `cloudstorage` when you configure a DPA. Otherwise, the `openshift-adp-controller-manager` pod fails with a crash loop segmentation fault due to the following settings: + +* If you define both `velero` and `cloudstorage`, the `openshift-adp-controller-manager` fails. +* If you do not define both `velero` and `cloudstorage`, the `openshift-adp-controller-manager` fails. + +For more information about this issue, see link:https://issues.redhat.com/browse/OADP-1054[OADP-1054]. \ No newline at end of file