Skip to content

K8SPSMDB-1211: handle FULL CLUSTER CRASH error during the restore #1926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented May 16, 2025

K8SPSMDB-1211 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPSMDB-1211

DESCRIPTION

Problem:
During the physical restore, the operator detects a FULL CLUSTER CRASH and attempts to resolve the issue. The operator log contains the FULL CLUSTER CRASH log message, which should not be logged because this error occurs 100% of the time during the physical restore.

Solution:
The solution is to perform the same action the (*ReconcilePerconaServerMongoDB) handleReplicaSetNoPrimary method does after the physical restore. Once PBM has finished the restore, the operator should recreate the statefulsets and add the percona.com/restore-in-progress annotation to them and handle the FULL CLUSTER CRASH state. Afterwards, the percona.com/restore-in-progress annotation should be removed from the statefulsets.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label May 16, 2025
@pull-request-size pull-request-size bot added size/XL 500-999 lines and removed size/XXL 1000+ lines labels May 19, 2025
@@ -0,0 +1,70 @@
package common
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think packages named common, utils, etc., tend to be vague, as they imply shared logic without a clearly defined domain or separation of concerns.

In this file, the main struct is CommonReconciler, but it's not clear what exactly is being reconciled. The struct also mixes responsibilities: as it's constructing and returning heterogeneous components like backup.PBM, mongo.Client, a scheme, and a k8s client.

To improve clarity and maintainability, I'd suggest:

  • Keeping the scheme and the Kubernetes client in ReconcilePerconaServerMongoDB, and having related function with receivers of type ReconcilePerconaServerMongoDB.

  • Splitting out PBM-related logic into a dedicated PBM factory/service.

  • Doing the same for the MongoClientProvider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
cross-site-sharded passed
custom-replset-name passed
custom-tls passed
custom-users-roles passed
custom-users-roles-sharded passed
data-at-rest-encryption failure
data-sharded passed
demand-backup passed
demand-backup-eks-credentials-irsa passed
demand-backup-fs passed
demand-backup-incremental failure
demand-backup-incremental-sharded failure
demand-backup-physical failure
demand-backup-physical-sharded failure
demand-backup-sharded passed
expose-sharded failure
finalizer passed
ignore-labels-annotations passed
init-deploy passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
monitoring-pmm3 passed
multi-cluster-service passed
multi-storage passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-physical failure
pitr-sharded passed
pitr-physical-backup-source passed
preinit-updates passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
stable-resource-version passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 60 out of 60

commit: 2614d85
image: perconalab/percona-server-mongodb-operator:PR-1926-2614d852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL 500-999 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants