Skip to content

oci-bv - Timed out waiting for backup to become available #491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
martysweet opened this issue Feb 20, 2025 · 2 comments
Open

oci-bv - Timed out waiting for backup to become available #491

martysweet opened this issue Feb 20, 2025 · 2 comments

Comments

@martysweet
Copy link

Hi,

We are using a VolumeSnapshotClass as below for Block volume snapshotting:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: oci-bv-snapshot-incremental
driver: blockvolume.csi.oraclecloud.com
parameters:
  backupType: incremental # No functional restore difference between full and incremental
deletionPolicy: Delete

This is integrated with CNPG for lower environment database volume snapshots.
Occasionally (every few weeks), we find these backups failing. with the error:

DeadlineExceeded desc = Timed out waiting for backup to become available

It looks like this is being thrown by the oci-bv csi here:

return nil, status.Errorf(codes.DeadlineExceeded, "Timed out waiting for backup to become available")

Which uses a timeout of 45 seconds as defined here:

volumeAvailableTimeoutCtx, cancel := context.WithTimeout(ctx, 45 * time.Second)

However, in practice a 45 second timeout is too conservative, looking in the logs, we see the following times for snapshot creation in uk-london-1 between going from com.oraclecloud.BlockVolumes.CreateVolumeBackup.begin to com.oraclecloud.BlockVolumes.CreateVolumeBackup.end state.

Over 9 samples: average: 37.4 seconds | min: 34 seconds | max: 41 seconds

With a backupPollInterval of 5 seconds, the CSI steps just outside of the permissible timeout of 45 seconds.
https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L150C36-L150C60

https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L42

I believe the solution for this would be to increase the available timeout to 60 seconds to align better with the expected response times from the API.

volumeAvailableTimeoutCtx, cancel := context.WithTimeout(ctx, 45 * time.Second)

Thanks!

@ms3rgio
Copy link

ms3rgio commented Feb 24, 2025

We are facing the same problem in the same situation.

@silvio89
Copy link

Same problem here, also using CNPG. However, I believe that 60 seconds won't be enough for us. We have an 8TB and a 20TB disks that takes longer than that to become available. None of our attempts with snapshots on this database cluster have been successful. The smaller ones are working fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants