Skip to content

Conversation

@takoverflow
Copy link
Member

@takoverflow takoverflow commented Oct 10, 2025

What this PR does / why we need it:
This PR adds a sync.Map that consists of list of all machines that are
being processed by the creation flow. This is used to decide whether a
machine that's scheduled for deletion (has its DeletionTimestamp set)
should be processed by the deletion flow. This helps avoid scenarios where
the machine object is removed while a CreateMachine() call is still being
processed by the Cloud provider, leading to orphan (leaked) VMs.

There's still an edge case where a machine is marked for deletion and is
processed by the creation flow but MCM crashes. Upon restart, since the map
would be empty, the machine would go to the termination queue and get removed
all the while the CSP creates a VM for the same removed machine leading to an orphan VM.

Which issue(s) this PR fixes:
Fixes #1035

Special notes for your reviewer:
Gingko tests hamper readability quite a lot, I'd request focusing on
action.testFunc which contains the crux of the functionality being tested.
Open to suggestions for alternative names.

Release note:

Added a safeguard to delay deletion of machines that are undergoing a `Create` Request to prevent orphaning of VMs.

@gardener-robot gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Oct 10, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 10, 2025
@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Oct 13, 2025
@takoverflow takoverflow marked this pull request as ready for review October 13, 2025 06:57
@takoverflow takoverflow requested a review from a team as a code owner October 13, 2025 06:57
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks fine. No concerns from me other than some small nits
Please add logs for IT

@takoverflow
Copy link
Member Author

Integration Test logs:

Running Suite: Controller Suite - /Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/test/integration/controller
===============================================================================================================================================
Random Seed: 1760347685

Will run 10 of 10 specs
------------------------------
[BeforeSuite]
/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:47
  > Enter [BeforeSuite] TOP-LEVEL @ 10/13/25 14:58:12.782
  STEP: Checking for the clusters if provided are available @ 10/13/25 14:58:12.783
  2025/10/13 14:58:12 Control cluster kube-config - /Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml
  2025/10/13 14:58:12 Target cluster kube-config  - /Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml
  STEP: Killing any existing processes @ 10/13/25 14:58:15.175
  STEP: Checking Machine-Controller-Manager repo is available at: ../../../dev/mcm @ 10/13/25 14:58:15.565
  STEP: Scaledown existing machine controllers @ 10/13/25 14:58:15.566
  STEP: Starting Machine Controller  @ 10/13/25 14:58:15.762
  STEP: Starting Machine Controller Manager @ 10/13/25 14:58:15.782
  STEP: Cleaning any old resources @ 10/13/25 14:58:15.793
  2025/10/13 14:58:15 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/10/13 14:58:16 machines.machine.sapcloud.io "test-machine" not found
  2025/10/13 14:58:16 machineclasses.machine.sapcloud.io "test-mc-v1" not found
  2025/10/13 14:58:16 machineclasses.machine.sapcloud.io "test-mc-v2" not found
  STEP: Setup MachineClass @ 10/13/25 14:58:16.528
  STEP: Looking for machineclass resource in the control cluster @ 10/13/25 14:58:17.859
  STEP: Looking for secrets refered in machineclass in the control cluster @ 10/13/25 14:58:18.05
  STEP: Initializing orphan resource tracker @ 10/13/25 14:58:18.421
  2025/10/13 14:58:22 orphan resource tracker initialized
  < Exit [BeforeSuite] TOP-LEVEL @ 10/13/25 14:58:22.35 (9.569s)
[BeforeSuite] PASSED [9.569 seconds]
------------------------------
Machine controllers test machine resource creation should not lead to any errors and add 1 more node in target cluster
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:649
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 14:58:22.351
  STEP: Checking machineController process is running @ 10/13/25 14:58:22.351
  STEP: Checking machineControllerManager process is running @ 10/13/25 14:58:22.351
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 14:58:22.351
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 14:58:22.978 (627ms)
  > Enter [It] should not lead to any errors and add 1 more node in target cluster @ 10/13/25 14:58:22.978
  STEP: Checking for errors @ 10/13/25 14:58:23.182
  STEP: Waiting until number of ready nodes is 1 more than initial nodes @ 10/13/25 14:58:23.379
  < Exit [It] should not lead to any errors and add 1 more node in target cluster @ 10/13/25 15:00:47.094 (2m24.118s)
• [144.745 seconds]
------------------------------
Machine controllers test machine resource deletion when machines available should not lead to errors and remove 1 node in target cluster
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:678
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:00:47.094
  STEP: Checking machineController process is running @ 10/13/25 15:00:47.094
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:00:47.094
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:00:47.094
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:00:47.508 (414ms)
  > Enter [It] should not lead to errors and remove 1 node in target cluster @ 10/13/25 15:00:47.508
  STEP: Checking for errors @ 10/13/25 15:00:48.372
  STEP: Waiting until test-machine machine object is deleted @ 10/13/25 15:00:48.576
  STEP: Waiting until number of ready nodes is equal to number of initial nodes @ 10/13/25 15:01:01.898
  < Exit [It] should not lead to errors and remove 1 node in target cluster @ 10/13/25 15:01:02.509 (15.002s)
• [15.416 seconds]
------------------------------
Machine controllers test machine resource deletion when machines are not available should keep nodes intact
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:717
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:01:02.51
  STEP: Checking machineController process is running @ 10/13/25 15:01:02.51
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:01:02.51
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:01:02.51
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:01:02.921 (411ms)
  > Enter [It] should keep nodes intact @ 10/13/25 15:01:02.921
  STEP: Skipping as there are machines available and this check can't be performed @ 10/13/25 15:01:03.115
  < Exit [It] should keep nodes intact @ 10/13/25 15:01:03.115 (194ms)
• [0.606 seconds]
------------------------------
Machine controllers test machine deployment resource creation with replicas=0, scale up with replicas=1 should not lead to errors and add 1 more node to target cluster
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:745
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:01:03.115
  STEP: Checking machineController process is running @ 10/13/25 15:01:03.116
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:01:03.116
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:01:03.116
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:01:03.526 (411ms)
  > Enter [It] should not lead to errors and add 1 more node to target cluster @ 10/13/25 15:01:03.526
  STEP: Checking for errors @ 10/13/25 15:01:03.73
  STEP: Waiting for Machine Set to be created @ 10/13/25 15:01:03.92
  STEP: Updating machineDeployment replicas to 1 @ 10/13/25 15:01:06.3
  STEP: Checking if machineDeployment's status has been updated with correct conditions @ 10/13/25 15:01:06.667
  STEP: Checking number of ready nodes==1 @ 10/13/25 15:03:48.744
  STEP: Fetching initial number of machineset freeze events @ 10/13/25 15:03:50.157
  < Exit [It] should not lead to errors and add 1 more node to target cluster @ 10/13/25 15:03:50.744 (2m47.22s)
• [167.631 seconds]
------------------------------
Machine controllers test machine deployment resource scale-up with replicas=6 should not lead to errors and add further 5 nodes to target cluster
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:813
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:03:50.744
  STEP: Checking machineController process is running @ 10/13/25 15:03:50.744
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:03:50.744
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:03:50.744
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:03:51.156 (412ms)
  > Enter [It] should not lead to errors and add further 5 nodes to target cluster @ 10/13/25 15:03:51.156
  STEP: Checking for errors @ 10/13/25 15:03:51.542
  STEP: Checking number of ready nodes are 6 more than initial @ 10/13/25 15:03:51.542
  < Exit [It] should not lead to errors and add further 5 nodes to target cluster @ 10/13/25 15:06:20.99 (2m29.836s)
• [150.248 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should not lead to errors and remove 4 nodes from target cluster
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:843
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:06:20.99
  STEP: Checking machineController process is running @ 10/13/25 15:06:20.99
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:06:20.991
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:06:20.991
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:06:21.589 (598ms)
  > Enter [It] should not lead to errors and remove 4 nodes from target cluster @ 10/13/25 15:06:21.589
  STEP: Checking for errors @ 10/13/25 15:06:22.656
  STEP: Checking number of ready nodes are 2 more than initial @ 10/13/25 15:06:22.656
  < Exit [It] should not lead to errors and remove 4 nodes from target cluster @ 10/13/25 15:06:46.963 (25.375s)
• [25.973 seconds]
------------------------------
Machine controllers test machine deployment resource scale-down with replicas=2 should freeze and unfreeze machineset temporarily
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:872
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:06:46.964
  STEP: Checking machineController process is running @ 10/13/25 15:06:46.964
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:06:46.964
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:06:46.964
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:06:47.382 (419ms)
  > Enter [It] should freeze and unfreeze machineset temporarily @ 10/13/25 15:06:47.382
  < Exit [It] should freeze and unfreeze machineset temporarily @ 10/13/25 15:06:47.933 (550ms)
• [0.969 seconds]
------------------------------
Machine controllers test machine deployment resource updation to v2 machine-class and replicas=4 should upgrade machines and add more nodes to target
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:881
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:06:47.933
  STEP: Checking machineController process is running @ 10/13/25 15:06:47.933
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:06:47.933
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:06:47.933
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:06:48.536 (603ms)
  > Enter [It] should upgrade machines and add more nodes to target @ 10/13/25 15:06:48.536
  STEP: Checking for errors @ 10/13/25 15:06:48.905
  STEP: UpdatedReplicas to be 4 @ 10/13/25 15:06:48.905
  STEP: AvailableReplicas to be 4 @ 10/13/25 15:06:55.655
  STEP: Number of ready nodes be 4 more @ 10/13/25 15:09:39.546
  < Exit [It] should upgrade machines and add more nodes to target @ 10/13/25 15:10:21.489 (3m32.956s)
• [213.560 seconds]
------------------------------
Machine controllers test machine deployment resource deletion When there are machine deployment(s) available in control cluster should not lead to errors and list only initial nodes
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:935
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:10:21.49
  STEP: Checking machineController process is running @ 10/13/25 15:10:21.49
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:10:21.49
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:10:21.49
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:10:21.919 (429ms)
  > Enter [It] should not lead to errors and list only initial nodes @ 10/13/25 15:10:21.919
  STEP: Checking for errors @ 10/13/25 15:10:22.102
  STEP: Waiting until number of ready nodes is equal to number of initial  nodes @ 10/13/25 15:10:22.289
  < Exit [It] should not lead to errors and list only initial nodes @ 10/13/25 15:10:56.31 (34.391s)
• [34.821 seconds]
------------------------------
Machine controllers test orphaned resources when the hyperscaler resources are queried should have been deleted
/Users/I752152/go/src/github.com/gardener/machine-controller-manager/pkg/test/integration/common/framework.go:972
  > Enter [BeforeEach] Machine controllers test @ 10/13/25 15:10:56.31
  STEP: Checking machineController process is running @ 10/13/25 15:10:56.31
  STEP: Checking machineControllerManager process is running @ 10/13/25 15:10:56.31
  STEP: Checking nodes in target cluster are healthy @ 10/13/25 15:10:56.31
  < Exit [BeforeEach] Machine controllers test @ 10/13/25 15:10:56.85 (540ms)
  > Enter [It] should have been deleted @ 10/13/25 15:10:56.85
  STEP: Querying and comparing @ 10/13/25 15:10:56.85
  < Exit [It] should have been deleted @ 10/13/25 15:11:00.683 (3.832s)
• [4.373 seconds]
------------------------------
[AfterSuite]
/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/test/integration/controller/controller_test.go:49
  > Enter [AfterSuite] TOP-LEVEL @ 10/13/25 15:11:00.683
  STEP: Running Cleanup @ 10/13/25 15:11:00.683
  2025/10/13 15:11:20 machinedeployments.machine.sapcloud.io "test-machine-deployment" not found
  2025/10/13 15:11:21 machines.machine.sapcloud.io "test-machine" not found
  2025/10/13 15:11:21 deleting test-mc-v1 machineclass
  2025/10/13 15:11:21 machineclass deleted
  2025/10/13 15:11:21 deleting test-mc-v2 machineclass
  2025/10/13 15:11:22 machineclass deleted
  STEP: Killing any existing processes @ 10/13/25 15:11:22.512
  2025/10/13 15:11:22 controller_manager --control-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i752152--scale-test --safety-up=2 --safety-down=1 --machine-safety-overshooting-period=300ms --leader-elect=false --v=3
  2025/10/13 15:11:22 stopMCM killed MCM process(es) with pid(s): [42398]
  2025/10/13 15:11:22 main --control-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i752152--scale-test --machine-creation-timeout=20m --machine-drain-timeout=5m --machine-health-timeout=10m --machine-pv-detach-timeout=2m --machine-safety-apiserver-statuscheck-timeout=30s --machine-safety-apiserver-statuscheck-period=1m --machine-safety-orphan-vms-period=30m --leader-elect=false --v=3
  2025/10/13 15:11:22 stopMCM killed MCM process(es) with pid(s): [42705]
  2025/10/13 15:11:22 go run cmd/machine-controller/main.go --control-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_control.yaml --target-kubeconfig=/Users/I752152/go/src/github.com/gardener/machine-controller-manager-provider-aws/dev/kube-configs/kubeconfig_target.yaml --namespace=shoot--i752152--scale-test --machine-creation-timeout=20m --machine-drain-timeout=5m --machine-health-timeout=10m --machine-pv-detach-timeout=2m --machine-safety-apiserver-statuscheck-timeout=30s --machine-safety-apiserver-statuscheck-period=1m --machine-safety-orphan-vms-period=30m --leader-elect=false --v=3
  2025/10/13 15:11:22 stopMCM killed MCM process(es) with pid(s): [42320]
  STEP: Scale back the existing machine controllers @ 10/13/25 15:11:23.042
  < Exit [AfterSuite] TOP-LEVEL @ 10/13/25 15:11:23.592 (22.909s)
[AfterSuite] PASSED [22.909 seconds]
------------------------------

Ran 10 of 10 Specs in 790.820 seconds
SUCCESS! -- 10 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 13m18.410612625s
Test Suite Passed
Integration tests completed successfully

Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review needs/second-opinion Needs second review by someone else labels Oct 13, 2025
@aaronfern aaronfern merged commit deae6e5 into gardener:master Oct 15, 2025
12 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 15, 2025
aaronfern pushed a commit to aaronfern/machine-controller-manager that referenced this pull request Oct 15, 2025
aaronfern added a commit that referenced this pull request Oct 15, 2025
* Safeguard to prevent termination of machines part of creation flow (#1036)

* Use fully qualified machine name for pending create map (#1043)

* Use fully qualified machine name for pending create map

* Extract pendingCreationMap operations into helper functions

---------

Co-authored-by: Prashant Tak <prashant.tak@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Machine object allowed to be added to the deletion queue when it is being processed by the creation flow

4 participants