Skip to content

Azure Stack Support #5532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions api/v1beta1/types_class.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ type AzureClusterClassSpec struct {
// - GermanCloud: "AzureGermanCloud"
// - PublicCloud: "AzurePublicCloud"
// - USGovernmentCloud: "AzureUSGovernmentCloud"
// - StackCloud: "AzureStackCloud"
//
// Note that values other than the default must also be accompanied by corresponding changes to the
// aso-controller-settings Secret to configure ASO to refer to the non-Public cloud. ASO currently does
Expand Down Expand Up @@ -186,6 +187,7 @@ type AzureManagedControlPlaneClassSpec struct {
// - PublicCloud: "AzurePublicCloud"
// - USGovernmentCloud: "AzureUSGovernmentCloud"
//
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to add AzureStack to the comment here as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally did not add it here, because I did not think there would be support for Azure Stack with managed control planes. Happy to discuss more, but I don't have a lot of background knowledge on this.

// Note that values other than the default must also be accompanied by corresponding changes to the
// aso-controller-settings Secret to configure ASO to refer to the non-Public cloud. ASO currently does
// not support referring to multiple different clouds in a single installation. The following fields must
Expand Down
28 changes: 28 additions & 0 deletions azure/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/Azure/azure-sdk-for-go/sdk/azcore/policy"
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5"
"github.com/Azure/azure-sdk-for-go/sdk/tracing/azotel"
azureautorest "github.com/Azure/go-autorest/autorest/azure"
"go.opentelemetry.io/otel"

"sigs.k8s.io/cluster-api-provider-azure/util/tele"
Expand All @@ -44,6 +45,8 @@ const (
ChinaCloudName = "AzureChinaCloud"
// USGovernmentCloudName is the name of the Azure US Government cloud.
USGovernmentCloudName = "AzureUSGovernmentCloud"
// StackCloudName is the name for Azure Stack hybrid cloud environments.
StackCloudName = "AzureStackCloud"
)

const (
Expand Down Expand Up @@ -109,6 +112,16 @@ const (
CustomHeaderPrefix = "infrastructure.cluster.x-k8s.io/custom-header-"
)

const (
// StackAPIVersionProfile is the API version profile to set for ARM clients. See:
// https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-profiles-azure-resource-manager-versions?view=azs-2408#overview-of-the-2020-09-01-hybrid-profile
StackAPIVersionProfile = "2020-06-01"

// StackDiskAPIVersionProfile is the API Version to set for the disk client.
// API Version Profile "2020-06-01" is not supported for disks.
StackDiskAPIVersionProfile = "2018-06-01"
)

var (
// LinuxBootstrapExtensionCommand is the command the VM bootstrap extension will execute to verify Linux nodes bootstrap completes successfully.
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then echo 'Error joining node to cluster: kubeadm init or join failed. To debug, check the cloud-init, kubelet, or other bootstrap logs: https://capz.sigs.k8s.io/self-managed/troubleshooting.html#checking-cloud-init-logs-ubuntu'; exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
Expand Down Expand Up @@ -367,6 +380,21 @@ func ARMClientOptions(azureEnvironment string, extraPolicies ...policy.Policy) (
opts.Cloud = cloud.AzureChina
case USGovernmentCloudName:
opts.Cloud = cloud.AzureGovernment
case StackCloudName:
cloudEnv, err := azureautorest.EnvironmentFromName(azureEnvironment)
if err != nil {
return nil, fmt.Errorf("unable to get Azure Stack cloud environment: %w", err)
}
opts.APIVersion = StackAPIVersionProfile
opts.Cloud = cloud.Configuration{
ActiveDirectoryAuthorityHost: cloudEnv.ActiveDirectoryEndpoint,
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
cloud.ResourceManager: {
Audience: cloudEnv.TokenAudience,
Endpoint: cloudEnv.ResourceManagerEndpoint,
},
},
}
case "":
// No cloud name provided, so leave at defaults.
default:
Expand Down
6 changes: 6 additions & 0 deletions azure/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ func ResourceNotFound(err error) bool {
return errors.As(err, &rerr) && rerr.StatusCode == http.StatusNotFound
}

// BadRequest parses an error to check if it its status code is Bad Request (400).
func BadRequest(err error) bool {
var rerr *azcore.ResponseError
return errors.As(err, &rerr) && rerr.StatusCode == http.StatusBadRequest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a simple unit test for this function in errors_test.go?

}

// VMDeletedError is returned when a virtual machine is deleted outside of capz.
type VMDeletedError struct {
ProviderID string
Expand Down
7 changes: 6 additions & 1 deletion azure/scope/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,7 @@ func (s *ClusterScope) VNetSpec() azure.ASOResourceSpecGetter[*asonetworkv1api20

// PrivateDNSSpec returns the private dns zone spec.
func (s *ClusterScope) PrivateDNSSpec() (zoneSpec azure.ResourceSpecGetter, linkSpec, recordSpec []azure.ResourceSpecGetter) {
if s.IsAPIServerPrivate() {
if s.IsAPIServerPrivate() && !s.IsAzureStack() {
resourceGroup := s.ResourceGroup()
if s.AzureCluster.Spec.NetworkSpec.PrivateDNSZoneResourceGroup != "" {
resourceGroup = s.AzureCluster.Spec.NetworkSpec.PrivateDNSZoneResourceGroup
Expand Down Expand Up @@ -1251,3 +1251,8 @@ func (s *ClusterScope) getLastAppliedSecurityRules(nsgName string) map[string]in
}
return lastAppliedSecurityRules
}

// IsAzureStack returns true if the cluster is running on Azure Stack.
func (s *ClusterScope) IsAzureStack() bool {
return strings.EqualFold(s.Environment.Name, azure.StackCloudName)
}
16 changes: 9 additions & 7 deletions azure/scope/machine.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,8 @@ func (m *MachineScope) InitMachineCache(ctx context.Context) error {
}

m.cache.availabilitySetSKU, err = skuCache.Get(ctx, string(armcompute.AvailabilitySetSKUTypesAligned), resourceskus.AvailabilitySets)
if err != nil {
// Resource SKU API for availability sets may not be available in Azure Stack environments.
if err != nil && !strings.EqualFold(m.CloudEnvironment(), azure.StackCloudName) {
return errors.Wrapf(err, "failed to get availability set SKU %s in compute api", string(armcompute.AvailabilitySetSKUTypesAligned))
}
}
Expand Down Expand Up @@ -497,12 +498,13 @@ func (m *MachineScope) AvailabilitySetSpec() azure.ResourceSpecGetter {
}

spec := &availabilitysets.AvailabilitySetSpec{
Name: availabilitySetName,
ResourceGroup: m.NodeResourceGroup(),
ClusterName: m.ClusterName(),
Location: m.Location(),
SKU: nil,
AdditionalTags: m.AdditionalTags(),
Name: availabilitySetName,
ResourceGroup: m.NodeResourceGroup(),
ClusterName: m.ClusterName(),
Location: m.Location(),
CloudEnvironment: m.CloudEnvironment(),
SKU: nil,
AdditionalTags: m.AdditionalTags(),
}

if m.cache != nil {
Expand Down
53 changes: 35 additions & 18 deletions azure/services/availabilitysets/spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,27 @@ package availabilitysets
import (
"context"
"strconv"
"strings"

"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5"
"github.com/pkg/errors"
"k8s.io/utils/ptr"

infrav1 "sigs.k8s.io/cluster-api-provider-azure/api/v1beta1"
"sigs.k8s.io/cluster-api-provider-azure/azure"
"sigs.k8s.io/cluster-api-provider-azure/azure/converters"
"sigs.k8s.io/cluster-api-provider-azure/azure/services/resourceskus"
)

// AvailabilitySetSpec defines the specification for an availability set.
type AvailabilitySetSpec struct {
Name string
ResourceGroup string
ClusterName string
Location string
SKU *resourceskus.SKU
AdditionalTags infrav1.Tags
Name string
ResourceGroup string
ClusterName string
Location string
CloudEnvironment string
SKU *resourceskus.SKU
AdditionalTags infrav1.Tags
}

// ResourceName returns the name of the availability set.
Expand Down Expand Up @@ -64,20 +67,10 @@ func (s *AvailabilitySetSpec) Parameters(_ context.Context, existing interface{}
return nil, nil
}

if s.SKU == nil {
return nil, errors.New("unable to get required availability set SKU from machine cache")
}

var faultDomainCount *int32
faultDomainCountStr, ok := s.SKU.GetCapability(resourceskus.MaximumPlatformFaultDomainCount)
if !ok {
return nil, errors.Errorf("unable to get required availability set SKU capability %s", resourceskus.MaximumPlatformFaultDomainCount)
}
count, err := strconv.ParseInt(faultDomainCountStr, 10, 32)
faultDomainCount, err := getFaultDomainCount(s.SKU, s.CloudEnvironment)
if err != nil {
return nil, errors.Wrapf(err, "unable to parse availability set fault domain count")
return nil, err
}
faultDomainCount = ptr.To[int32](int32(count))

asParams := armcompute.AvailabilitySet{
SKU: &armcompute.SKU{
Expand All @@ -98,3 +91,27 @@ func (s *AvailabilitySetSpec) Parameters(_ context.Context, existing interface{}

return asParams, nil
}

func getFaultDomainCount(sku *resourceskus.SKU, cloudEnvironment string) (*int32, error) {
// Azure Stack environments may not implement the resource SKU API
// for availability sets. Use a default value instead.
if strings.EqualFold(cloudEnvironment, azure.StackCloudName) {
return ptr.To(int32(2)), nil
}
Comment on lines +98 to +100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test case for when cloud environment is Azure Stack to TestParameters in spec_test.go?


if sku == nil {
return nil, errors.New("unable to get required availability set SKU from machine cache")
}

var faultDomainCount *int32
faultDomainCountStr, ok := sku.GetCapability(resourceskus.MaximumPlatformFaultDomainCount)
if !ok {
return nil, errors.Errorf("unable to get required availability set SKU capability %s", resourceskus.MaximumPlatformFaultDomainCount)
}
count, err := strconv.ParseInt(faultDomainCountStr, 10, 32)
if err != nil {
return nil, errors.Wrapf(err, "unable to parse availability set fault domain count")
}
faultDomainCount = ptr.To(int32(count))
return faultDomainCount, nil
}
4 changes: 4 additions & 0 deletions azure/services/disks/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package disks

import (
"context"
"strings"
"time"

"github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime"
Expand All @@ -38,6 +39,9 @@ type azureClient struct {
// newClient creates a new disks client from an authorizer.
func newClient(auth azure.Authorizer, apiCallTimeout time.Duration) (*azureClient, error) {
opts, err := azure.ARMClientOptions(auth.CloudEnvironment())
if strings.EqualFold(auth.CloudEnvironment(), azure.StackCloudName) {
opts.APIVersion = azure.StackDiskAPIVersionProfile
}
if err != nil {
return nil, errors.Wrap(err, "failed to create disks client options")
}
Expand Down
7 changes: 7 additions & 0 deletions azure/services/publicips/publicips.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package publicips

import (
"context"
"strings"

"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v4"
"github.com/pkg/errors"
Expand Down Expand Up @@ -151,6 +152,12 @@ func (s *Service) Delete(ctx context.Context) error {
// isIPManaged returns true if the IP has an owned tag with the cluster name as value,
// meaning that the IP's lifecycle is managed.
func (s *Service) isIPManaged(ctx context.Context, spec azure.ResourceSpecGetter) (bool, error) {
if strings.EqualFold(s.Scope.CloudEnvironment(), azure.StackCloudName) {
// Azure Stack does not yet support getting tags with scope,
// so assume IPs are managed.
return true, nil
}

scope := azure.PublicIPID(s.Scope.SubscriptionID(), spec.ResourceGroupName(), spec.ResourceName())
result, err := s.TagsGetter.GetAtScope(ctx, scope)
if err != nil {
Expand Down
11 changes: 9 additions & 2 deletions azure/services/virtualmachines/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,14 +109,21 @@ func (ac *AzureClient) CreateOrUpdateAsync(ctx context.Context, spec azure.Resou
// request to Azure and if accepted without error, the func will return a Poller which can be used to track the ongoing
// progress of the operation.
func (ac *AzureClient) DeleteAsync(ctx context.Context, spec azure.ResourceSpecGetter, resumeToken string) (poller *runtime.Poller[armcompute.VirtualMachinesClientDeleteResponse], err error) {
ctx, _, done := tele.StartSpanWithLogger(ctx, "virtualmachines.AzureClient.Delete")
ctx, log, done := tele.StartSpanWithLogger(ctx, "virtualmachines.AzureClient.Delete")
defer done()

forceDelete := ptr.To(true)
opts := &armcompute.VirtualMachinesClientBeginDeleteOptions{ResumeToken: resumeToken, ForceDeletion: forceDelete}
poller, err = ac.virtualmachines.BeginDelete(ctx, spec.ResourceGroupName(), spec.ResourceName(), opts)
if err != nil {
return nil, err
if azure.BadRequest(err) {
log.Info("Failed to Begin VM Delete with Force Deletion, retrying without the force flag")
opts.ForceDeletion = ptr.To(false)
poller, err = ac.virtualmachines.BeginDelete(ctx, spec.ResourceGroupName(), spec.ResourceName(), opts)
}
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good improvement, but is it related to Azure Stack support? What problem is it trying to solve?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, azure stack throws a 400 error that says the force flag is not supported... Should I add a comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I think I forgot to add the comment. This is specified in the commit message, but I will add a code comment in my next pass.

}

ctx, cancel := context.WithTimeout(ctx, ac.apiCallTimeout)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ spec:
- GermanCloud: "AzureGermanCloud"
- PublicCloud: "AzurePublicCloud"
- USGovernmentCloud: "AzureUSGovernmentCloud"
- StackCloud: "AzureStackCloud"

Note that values other than the default must also be accompanied by corresponding changes to the
aso-controller-settings Secret to configure ASO to refer to the non-Public cloud. ASO currently does
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ spec:
- GermanCloud: "AzureGermanCloud"
- PublicCloud: "AzurePublicCloud"
- USGovernmentCloud: "AzureUSGovernmentCloud"
- StackCloud: "AzureStackCloud"

Note that values other than the default must also be accompanied by corresponding changes to the
aso-controller-settings Secret to configure ASO to refer to the non-Public cloud. ASO currently does
Expand Down
12 changes: 11 additions & 1 deletion controllers/azuremachine_reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package controllers

import (
"context"
"strings"

"github.com/pkg/errors"

Expand Down Expand Up @@ -101,10 +102,19 @@ func newAzureMachineService(machineScope *scope.MachineScope) (*azureMachineServ
virtualmachinesSvc,
roleAssignmentsSvc,
vmextensionsSvc,
tagsSvc,
},
skuCache: cache,
}

// The tags service fails in Azure Stack because the current SDK implementation
// will throw an error when trying to get tags at scope on Azure Stack environments.
// This means tags can only be provided on Azure Stack machines at creation time
// and will not be reconciled day-2. Once the get-tags-at-scope SDK issue is
// addressed, this change can be reverted to add tagsSvc in all environments.
if !strings.EqualFold(machineScope.CloudEnvironment(), azure.StackCloudName) {
ams.services = append(ams.services, tagsSvc)
}

ams.Reconcile = ams.reconcile
ams.Pause = ams.pause
ams.Delete = ams.delete
Expand Down
Loading