diff --git a/docs/proposals/20250425-aro-hcp.md b/docs/proposals/20250425-aro-hcp.md new file mode 100644 index 00000000000..5585eb2976a --- /dev/null +++ b/docs/proposals/20250425-aro-hcp.md @@ -0,0 +1,664 @@ +--- +title: ARO HCP Clusters +authors: + - "@serngawy" +reviewers: + - "" +creation-date: 2025-04-23 +last-updated: 2025-04-25 +status: implementable +--- + + +# Create ARO HCP Clusters + + +## Table of Contents + + +- [Introduction](#introduction) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Proposal](#proposal) +- [User Stories](#user-stories) +- [Implementation Details](#implementation-details) +- [Risks and Mitigations](#risks-and-mitigations) +- [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) + + +## Introduction + + +The Cluster API Provider for Azure (CAPZ) aims to extend Kubernetes Cluster API functionality to Microsoft Azure environments, facilitating the management and lifecycle of Kubernetes clusters on Azure. This proposal outlines the integration of Azure Red Hat OpenShift (ARO) Hybrid Cloud Platform (HCP) clusters within CAPZ. + + +## Goals + + +- **Integration**: Enable provisioning and management of ARO HCP clusters using CAPZ. +- **API Conformance**: Align CAPZ operations with ARO HCP API specifications. +- **Scalability**: Support scaling of ARO HCP clusters through CAPZ controllers. +- **Operational Excellence**: Provide seamless cluster lifecycle management, including creation, scaling, and deletion. + + +## Non-Goals + + +- Support for non-HCP Azure services not specified in the ARO HCP API. + + +## Proposal + + +CAPZ will introduce controllers and reconcilers to interact with the ARO HCP API endpoints defined in [ARO HCP OpenAPI Specification](https://github.com/Azure/ARO-HCP/blob/main/api/redhatopenshift/resource-manager/Microsoft.RedHatOpenShift/hcpclusters/preview/2024-06-10-preview/openapi.json). This includes: + + +- **Cluster Creation**: Implementing CAPZ controllers to provision ARO HCP clusters based on user-defined specifications. +- **Lifecycle Management**: Supporting cluster scaling, updates, and deletion via CAPZ workflows. +- **Monitoring and Metrics**: Integrating with Azure monitoring services for CAPZ-managed ARO HCP clusters. + + +​CAPZ will introduce new controllers and reconcilers that manage a dedicated Custom Resource Definition (CRD) representing ARO HCP clusters. These controllers will reconcile the CRD's desired state by interacting with the ARO HCP API endpoints, as defined in the [ARO HCP OpenAPI Specification](https://github.com/Azure/ARO-HCP/blob/main/api/redhatopenshift/resource-manager/Microsoft.RedHatOpenShift/hcpclusters/preview/2024-06-10-preview/openapi.json). This approach enables declarative management of ARO HCP clusters through Kubernetes-native workflows.​ + + +Key capabilities include: + + +- **Cluster Provisioning**: Users can define ARO HCP clusters declaratively via the CRD. CAPZ controllers will interpret these definitions and provision clusters accordingly through the ARO HCP API.​ + + +- **Lifecycle Management**: The controllers will handle scaling, upgrades, and deletion of ARO HCP clusters by reconciling changes in the CRD with the ARO HCP API.​ + + +- **Monitoring and Metrics**: Integration with Azure monitoring services will provide observability for CAPZ-managed ARO HCP clusters.​ + + +To control the rollout of this functionality, a new Feature Gate named ARO-HCP will be introduced. This feature gate allows users to enable or disable the ARO HCP integration within CAPZ, facilitating gradual adoption and testing. + + +## User Stories + + +- As a Kubernetes operator, I want to deploy an ARO HCP cluster using familiar Kubernetes APIs via CAPZ. +- As a platform engineer, I need automated scaling and management capabilities for ARO HCP clusters using CAPZ controllers. + + +## Implementation Details + + +### Controller Architecture: + + +Develop CAPZ controllers for ARO HCP resource types specified in the API. To support ARO HCP clusters within CAPZ, Three new Custom Resource Definitions (CRDs) and their corresponding controllers will be introduced: + + +- **AROControlPlane CRD (HcpOpenShiftCluster)**: Represents the desired state of an ARO Hybrid Cloud Platform cluster's control plane, based on the ARO HCP OpenAPI Specification defined as [HcpOpenShiftCluster](https://github.com/Azure/ARO-HCP/blob/main/internal/api/hcpopenshiftcluster.go#L22). + + +```go +// AROControlPlane is the Schema for the AROControlPlane API. +type AROControlPlane struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + + Spec AROControlPlaneSpec `json:"spec,omitempty"` + Status AROControlPlaneStatus `json:"status,omitempty"` +} + + + + +type AROControlPlaneSpec struct { + // Aro Cluster name must be valid DNS-1035 label. + AroClusterName string `json:"aroClusterName"` + + + // PlatformProfile represents the Azure platform configuration. + Platform PlatformProfile `json:"platform,omitempty"` + + // Visibility represents the visibility of an API endpoint. Allowed values are public and private default is public. + Visibility string `json:"visibility,omitempty"` + + // Network config for the ARO HCP cluster. + Network *NetworkSpec `json:"network,omitempty"` + + // DomainPrefix is an optional prefix added to the cluster's domain name. + DomainPrefix string `json:"domainPrefix,omitempty"` + + // OpenShift semantic version, for example "4.20.0". + Version string `json:"version"` + + + // OpenShift version channel group, default is stable. Allowed values are stable, candidate and nightly default is stable + ChannelGroup string `json:"channelGroup"` + + // VersionGate requires acknowledgment when upgrading ARO-HCP y-stream versions (e.g., from 4.20 to 4.21). + // Allowed values are Acknowledge, WaitForAcknowledge and AlwaysAcknowledge default is WaitForAcknowledge. + VersionGate string `json:"versionGate"` + + + // IdentityRef is a reference to an identity to be used when reconciling the aro control plane. + // If no identity is specified, the default identity for this controller will be used. + IdentityRef *corev1.ObjectReference `json:"identityRef,omitempty"` + + + // AdditionalTags are user-defined tags to be added to the Azure resources associated with the control plane. + AdditionalTags infrav1.Tags `json:"additionalTags,omitempty"` +} + + +// PlatformProfile represents the Azure platform configuration. +type PlatformProfile struct { + // Location should be valid Azure location ex; centralus + Location string `json:"location,omitempty"` + + + // Resource group name where the Aro-hcp will be attached to it. + ResourceGroup string `json:"resourceGroup,omitempty"` + + + // ResourceGroup Ref name that is used to create the ResourceGroup CR. The ResourceGroupRef must be in the same namespace as the AroControlPlane. + ResourceGroupRef string `json:"resourceGroupRef,omitempty"` + + + // Azure subnet id + Subnet string `json:"subnet,omitempty"` + + + // Subnet Ref name that is used to create the VirtualNetworksSubnet CR. The SubnetRef must be in the same namespace as the AroControlPlane and cannot be set with Subnet. + SubnetRef string `json:"subnetRef,omitempty"` + + + // OutboundType represents a routing strategy to provide egress to the Internet. Allowed value is loadBalancer + OutboundType string `json:"outboundType,omitempty"` + + // Azure Network Security Group ID + NetworkSecurityGroupID string `json:"networkSecurityGroupId,omitempty"` + + // ManagedIdentities Azure managed identities for ARO HCP. + ManagedIdentities ManagedIdentities `json:"managedIdentities,omitempty"` +} + +type ManagedIdentities struct { + // CreateAROHCPManagedIdentities is used to create the required ARO-HCP managed identities if not provided. + // It will create UserAssignedIdentity CR for each required managed identity. Default is false. + CreateAROHCPManagedIdentities bool `json:"createAROHCPManagedIdentities,omitempty"` + + // ControlPlaneOperators Ref to Microsoft.ManagedIdentity/userAssignedIdentities + ControlPlaneOperators *ControlPlaneOperators `json:"controlPlaneOperators,omitempty"` + + // DataPlaneOperators ref to Microsoft.ManagedIdentity/userAssignedIdentities + DataPlaneOperators *DataPlaneOperators `json:"dataPlaneOperators,omitempty"` + + // ServiceManagedIdentity ref to Microsoft.ManagedIdentity/userAssignedIdentities + ServiceManagedIdentity string `json:"serviceManagedIdentity,omitempty"` +} + +type ControlPlaneOperators struct { + // ControlPlaneManagedIdentities "control-plane" Microsoft.ManagedIdentity/userAssignedIdentities + ControlPlaneManagedIdentities string `json:"controlPlaneOperatorsManagedIdentities,omitempty"` + + // ClusterAPIAzureManagedIdentities "cluster-api-azure" Microsoft.ManagedIdentity/userAssignedIdentities + ClusterAPIAzureManagedIdentities string `json:"clusterAPIAzureManagedIdentities,omitempty"` + + // CloudControllerManagerManagedIdentities "cloud-controller-manager" Microsoft.ManagedIdentity/userAssignedIdentities + CloudControllerManagerManagedIdentities string `json:"cloudControllerManager,omitempty"` + + // IngressManagedIdentities "ingress" Microsoft.ManagedIdentity/userAssignedIdentities + IngressManagedIdentities string `json:"ingressManagedIdentities,omitempty"` + + // DiskCsiDriverManagedIdentities "disk-csi-driver" Microsoft.ManagedIdentity/userAssignedIdentities + DiskCsiDriverManagedIdentities string `json:"diskCsiDriverManagedIdentities,omitempty"` + + // FileCsiDriverManagedIdentities "file-csi-driver" Microsoft.ManagedIdentity/userAssignedIdentities + FileCsiDriverManagedIdentities string `json:"fileCsiDriverManagedIdentities,omitempty"` + + // ImageRegistryManagedIdentities "image-registry" Microsoft.ManagedIdentity/userAssignedIdentities + ImageRegistryManagedIdentities string `json:"imageRegistryManagedIdentities,omitempty"` + + // CloudNetworkConfigManagedIdentities "cloud-network-config" Microsoft.ManagedIdentity/userAssignedIdentities + CloudNetworkConfigManagedIdentities string `json:"cloudNetworkConfigManagedIdentities,omitempty"` + + // KmsManagedIdentities "kms" Microsoft.ManagedIdentity/userAssignedIdentities + KmsManagedIdentities string `json:"kmsManagedIdentities,omitempty"` +} + +type DataPlaneOperators struct { + // DiskCsiDriverManagedIdentities "disk-csi-driver" Microsoft.ManagedIdentity/userAssignedIdentities + DiskCsiDriverManagedIdentities string `json:"diskCsiDriverManagedIdentities,omitempty"` + + // FileCsiDriverManagedIdentities "file-csi-driver" Microsoft.ManagedIdentity/userAssignedIdentities + FileCsiDriverManagedIdentities string `json:"fileCsiDriverManagedIdentities,omitempty"` + + // ImageRegistryManagedIdentities "image-registry" Microsoft.ManagedIdentity/userAssignedIdentities + ImageRegistryManagedIdentities string `json:"imageRegistryManagedIdentities,omitempty"` +} + +type NetworkSpec struct { + // IP addresses block used by OpenShift while installing the cluster, for example "10.0.0.0/16". + MachineCIDR string `json:"machineCIDR,omitempty"` + + // IP address block from which to assign pod IP addresses, for example `10.128.0.0/14`. + PodCIDR string `json:"podCIDR,omitempty"` + + // IP address block from which to assign service IP addresses, for example `172.30.0.0/16`. + ServiceCIDR string `json:"serviceCIDR,omitempty"` + + // Network host prefix which is defaulted to `23` if not specified. + HostPrefix int `json:"hostPrefix,omitempty"` + + // The CNI network type default is OVNKubernetes. Allowed values are OVNKubernetes and Other. + NetworkType string `json:"networkType,omitempty"` +} + +type AROControlPlaneStatus struct { + // Initialized indicates whether or not the control plane has initialized. + Initialized bool `json:"initialized"` + + // Ready indicates that the AROControlPlane API Server is ready to receive requests. + Ready bool `json:"ready"` + + // FailureMessage will be set in the event that there is a terminal problem + FailureMessage *string `json:"failureMessage,omitempty"` + + // Conditions specifies the conditions for the managed control plane + Conditions clusterv1.Conditions `json:"conditions,omitempty"` + + // ID is the cluster ID given by ARO-HCP. + ID string `json:"id,omitempty"` + + // ConsoleURL is the url for the ARO-HCP openshift console. + ConsoleURL string `json:"consoleURL,omitempty"` + + // APIURL is the url for the ARO-HCP openshift cluster api endPoint. + APIURL string `json:"apiURL,omitempty"` + + // ARO-HCP OpenShift semantic version, for example "4.20.0". + Version string `json:"version"` + + // Available upgrades for the ARO hosted control plane. + AvailableUpgrades []string `json:"availableUpgrades,omitempty"` +} +``` + + +An example of AROControlPlane CR as below + + +```yaml +apiVersion: controlplane.cluster.x-k8s.io/v1beta2 +kind: AROControlPlane +metadata: + name: aro-control-plane + namespace: default +spec: + aroClusterName: aro-cluster-01 + platform: + location: east-us + resourceGroup: "dev-group" + subnet: "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aro-vnet-rg/providers/Microsoft.Network/virtualNetworks/aro-vnet/subnets/aro-subnet" + outboundType: loadBalancer + networkSecurityGroupId: "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aro-nsg-rg/providers/Microsoft.Network/networkSecurityGroups/aro-nsg" + managedIdentities: + controlPlaneOperators: {} + dataPlaneOperators: {} + serviceManagedIdentity: "/subscriptions/.../userAssignedIdentities/service-msi" + visibility: public + network: + machineCIDR: "10.0.0.0/16" + podCIDR: "10.128.0.0/14" + serviceCIDR: "172.30.0.0/16" + hostPrefix: 23 + networkType: OVNKubernetes + domainPrefix: aro-cluster + version: "4.20.0" + channelGroup: stable + versionGate: WaitForAcknowledge + identityRef: + kind: AzureClusterIdentity + name: aro-identity + namespace: default + additionalTags: + environment: production + owner: sre-team +status: + initialized: true + ready: true + failureMessage: "" + conditions: + - type: Ready + status: "True" + reason: ClusterReady + message: ARO Control Plane is ready + lastTransitionTime: "2025-04-28T10:00:00Z" + id: aro-hcp-123456 + consoleURL: "https://console-openshift-console.apps.aro-cluster.example.com" + apiURL: "https://api.aro-cluster.example.com:6443" + version: "4.20.0" + availableUpgrades: + - "4.20.1" + - "4.21.0" + - "4.22.0" +``` + + + + +- **AROMachinePool CRD (NodePool)**: Represents the desired state of the worker nodes (compute node pools) associated with a given `AROControlPlane`. The CRD spec based on the ARO HCP OpenAPI Specification defined as [HCPOpenShiftClusterNodePool](https://github.com/Azure/ARO-HCP/blob/main/internal/api/hcpopenshiftclusternodepool.go#L23). + + +```go +// AROMachinePool is the Schema for the AROMachinePool API. +type AROMachinePool struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + + Spec AROMachinePoolSpec `json:"spec,omitempty"` + Status AROMachinePoolStatus `json:"status,omitempty"` +} + + +// AROMachinePoolSpec defines the desired spec of AROMachinePool. +type AROMachinePoolSpec struct { + // nodePool name must be valid DNS-1035 label. + NodePoolName string `json:"nodePoolName,omitempty"` + + + // OpenShift semantic version, for example "4.20.0". Version specifies the OpenShift version of the nodes associated with this machine pool. + Version string `json:"version,omitempty"` + + + // PlatformProfile represents the NodePool Azure platform configuration. + Platform PlatformProfile `json:"platform,omitempty"` + + + // Labels specifies labels for the Kubernetes node objects + Labels map[string]string `json:"labels,omitempty"` + + + // Taints specifies the taints to apply to the nodes of the machine pool + Taints []AROTaint `json:"taints,omitempty"` + + + // AdditionalTags are user-defined tags to be added to the nodePool Azure resources. + AdditionalTags infrav1.Tags `json:"additionalTags,omitempty"` + + + // AutoRepair specifies whether health checks should be enabled for machines + AutoRepair bool `json:"autoRepair,omitempty"` + + + // Autoscaling specifies auto scaling behaviour for this MachinePool. + Autoscaling *AROMachinePoolAutoScaling `json:"autoscaling,omitempty"` +} + + +// PlatformProfile represents the Azure platform configuration. +type PlatformProfile struct { + // Azure subnet id + Subnet string `json:"subnet,omitempty"` + + + // Subnet Ref name that is used to create the VirtualNetworksSubnet CR. The SubnetRef must be in the same namespace as the AroMachinePool and cannot be set with Subnet. + SubnetRef string `json:"subnetRef,omitempty"` + + + // VMSize sets the VM disk volume size to the node. + VMSize string `json:"vmSize,omitempty"` + + + // DiskSizeGiB sets the disk volume size for the machine pool, in Gib. + DiskSizeGiB int32 `json:"diskSizeGiB,omitempty"` + + + // DiskStorageAccountType represents supported Azure storage account types. + // Available values are Premium_LRS, StandardSSD_LRS and Standard_LRS. + // With kubebuilder:validation:Enum= Premium_LRS;StandardSSD_LRS;Standard_LRS + DiskStorageAccountType string `json:"diskStorageAccountType,omitempty"` + + + // AvailabilityZone specifying the availability zone where instances of this machine pool should run. + AvailabilityZone string `json:"availabilityZone,omitempty"` +} + + +// AROTaint represents a taint to be applied to a node. +type AROTaint struct { + // The taint key to be applied to a node. + Key string `json:"key"` + + + // The taint value corresponding to the taint key. + Value string `json:"value,omitempty"` + + + // The effect of the taint on pods that do not tolerate the taint. + // Valid effects values are NoSchedule, PreferNoSchedule and NoExecute. + Effect corev1.TaintEffect `json:"effect"` +} + + +// AROMachinePoolAutoScaling specifies scaling options. +type AROMachinePoolAutoScaling struct { + // MinReplicas for the nodePool nodes. Cannot be bigger than max replica + MinReplicas int `json:"minReplicas,omitempty"` + + + // MaxReplicas for the nodePool nodes. Cannot be less than min replica + MaxReplicas int `json:"maxReplicas,omitempty"` +} + + +// AROMachinePoolStatus defines the observed state of AROMachinePool. +type AROMachinePoolStatus struct { + // Conditions specifies the conditions for the machine Pool + Conditions clusterv1.Conditions `json:"conditions,omitempty"` + + + // ID is the NodePool ID given by ARO-HCP. + ID string `json:"id,omitempty"` + + + // ProvisioningState represents the asynchronous provisioning state of an ARM resource. + // Allowed values are; Succeeded, Failed, Canceled, Accepted, Deleting, Provisioning and Updating. + ProvisioningState string `json:"provisioningState,omitempty"` + + + // Ready indicates that the AROMachinePool (nodePool) has joined the ARO-HCP cluster and is ready to deploy workload. + Ready bool `json:"ready"` + + + // FailureMessage will be set in the event that there is a terminal problem + FailureMessage *string `json:"failureMessage,omitempty"` + + + // Replicas are the most recently observed number of replicas. + Replicas int32 `json:"replicas"` + + + // ARO-HCP OpenShift semantic version, for example "4.20.0". + Version string `json:"version"` + + + // Available upgrades for the ARO MachinePool. + AvailableUpgrades []string `json:"availableUpgrades,omitempty"` +} +``` + + +An example of AROMachinePool CR as below + + +```yaml +apiVersion: infrastructure.openshift.io/v1beta1 +kind: AROMachinePool +metadata: + name: example-aromachinepool + namespace: default +spec: + nodePoolName: worker-eastus + version: "4.20.0" + platform: + subnet: "subnets/worker-subnet" + vmSize: "Standard_D4s_v3" + diskSizeGiB: 128 + diskStorageAccountType: "Premium_LRS" + availabilityZone: "zone-1" + labels: + node-role.kubernetes.io/worker: "" + region: east-us + taints: + - key: "example.com/special" + value: "true" + effect: "NoSchedule" + additionalTags: + environment: production + cost-center: engineering + autoRepair: true + autoscaling: + minReplicas: 3 + maxReplicas: 6 +status: + conditions: + - type: Ready + status: "True" + reason: MachinesReady + message: All machines in the pool are available + lastTransitionTime: "2025-04-28T12:00:00Z" + id: "nodePools/worker-eastus" + provisioningState: "Succeeded" + ready: true + replicas: 5 + version: "4.20.0" + availableUpgrades: + - "4.20.1" + - "4.21.0" + - "4.22.0" +``` + +- **AROCluster CRD**: Represents the desired state of the ARO managed cluster associated with the given `AROControlPlane` and `AROMachinePool`. + +```go +// AROCluster is the Schema for the AROClusters API. +type AROCluster struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + + Spec AROClusterSpec `json:"spec,omitempty"` + Status AROClusterStatus `json:"status,omitempty"` +} + + +// AROClusterSpec defines the desired spec of AROCluster. +type AROClusterSpec struct { + // ControlPlaneEndpoint represents the endpoint used to communicate with the control plane. + ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint"` +} + + +// AROClusterStatus defines the observed state of AROCluster. +type AROClusterStatus struct { + // Ready is when the AROControlPlane has an API server URL. + Ready bool `json:"ready,omitempty"` + + + // FailureDomains specifies a list of failure domains associated with the ARO cluster + FailureDomains clusterv1.FailureDomains `json:"failureDomains,omitempty"` + + + // Conditions define the current service state of the AROCluster. + Conditions clusterv1.Conditions `json:"conditions,omitempty"` +} +``` + +An example of AROCluster CR as below + +```yaml +apiVersion: infrastructure.openshift.io/v1beta1 +kind: AROCluster +metadata: + name: example-arocluster + namespace: default + labels: + cluster.x-k8s.io/cluster-name: aro-cluster-01 +spec: + controlPlaneEndpoint: + host: api.example-arocluster.example.com + port: 6443 +status: + ready: true + conditions: + - type: Ready + status: "True" + reason: ControlPlaneReady + message: ARO Cluster control plane is up and running + lastTransitionTime: "2025-04-28T12:00:00Z" +``` + +Each controller will be responsible for: + +- **AROControlPlane Controller**: + - Watch `AROControlPlane` resources. + - Reconciles the desired state by calling the Azure ARO HCP API to: + - Create/update/delete the corresponding HCP cluster on Azure. + - Manage cluster-wide settings like networking, identities, ingress, and API server visibility. + - Updates the `status` field of the `AROControlPlane` resource with real-time cluster health and endpoint information. + +- **AROMachinePool Controller**: + - Watch `AROMachinePool` resources. + - Reconciles the desired state by calling the Azure ARO HCP API to: + - Create/update/delete node pools (compute plane) associated with a specific `AROControlPlane` cluster. + - Manage node counts, VM sizes, disk settings, scaling configurations, and availability zones. + - Updates the `status` field of the `AROMachinePool` resource with node pool health and scaling status. + +- **AROCluster Controller**: + - Watch `AROCluster` resources. + - Reconciles the desired state of the ARO Cluster reflecting the associated `AROControlPlane` & `AROMachinePool` status: + - Create/update/delete ARO cluster associated with its `AROControlPlane` and `AROMachinePools`. + - Updates the `status` field of the `AROCluster` resource reflecting its `AROControlPlane` health and scaling status. + +#### Syncing Between Controllers + +- **Ownership and References**: + - Each `AROMachinePool` must reference an existing `AROControlPlane` via `spec.clusterName`. + - The `AROMachinePool` controller will ensure that the corresponding `AROControlPlane` exists and is in a "Ready" state before proceeding to manage the NodePool resource. + - The ownership will be reflected using Kubernetes `ownerReferences` metadata for automatic cascading deletes (i.e., when an `AROControlPlane` is deleted, its associated `AROMachinePools` are automatically cleaned up). + + +- **Reconciliation Ordering**: + - The `AROControlPlane` controller must reconcile and reach a "Ready" state before `AROMachinePool` resources are reconciled. + - The `AROMachinePool` controller will watch for events or status changes in the referenced `AROControlPlane` and only proceed when appropriate. + - **Status Propagation**: + - The `AROMachinePool` status will propagate partial status information (such as NodePool health) back to the parent `AROControlPlane` if needed, enabling centralized cluster health reporting. + + +- **Error Handling and Retries**: + - Both controllers will be designed with proper retry mechanisms and will handle transient Azure API errors gracefully (e.g., rate limiting, network issues). + +This architecture ensures a clean separation of concerns between managing the control plane and compute resources while maintaining a strong, validated linkage between the two layers. + +#### ARO-HCP Feature Gate + +A new [feature gate](https://github.com/serngawy/cluster-api-provider-azure/blob/aro-hcp-Proposal/feature/feature.go) `ARO-HCP` will be introduced to enable/disable the ARO-HCP controllers. + +### Azure Resource Management: + +Leverage the [ARO-HCP Azure SDKs](https://github.com/Azure/ARO-HCP/tree/main/internal/api) to securely communicate with the ARO HCP APIs, enabling the creation, updating, and deletion of ARO-HCP resources within the Azure infrastructure. + +### Validation and Testing: + +Validate CAPZ operations against ARO HCP API compliance and Kubernetes integration tests. + +## Risks and Mitigations + +- **API Changes**: Monitor ARO HCP API updates and adapt CAPZ controllers accordingly. +- **Security**: Implement secure communication protocols between CAPZ and ARO HCP endpoints. + +## Graduation Criteria + +- Successful provisioning and scaling of ARO HCP clusters using CAPZ in a test environment. +- Validation of CAPZ-managed ARO HCP clusters against Azure and Kubernetes integration benchmarks.