-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Is your feature request related to a problem? Please describe.
Since Rancher v2.6.11 it is now mandatory in some cases for RKE clusters in AWS to set a newly introduced config named useInstanceMetadataHostname under rancherKubernetesEngineConfig.cloudProvider.awsCloudProvider in your Rancher cluster config.
Upgrading from Rancher v2.6.10 to v2.6.11 will break your clusters if you don't, we've learned that the hard way. All nodes in the cluster lost most TCP connectivity between each other + from the AWS Load Balancer. So for now we've set useInstanceMetadataHostname to true manually in our cluster config via the Rancher UI.
I can't seem to find anything related to this in the rancher2_cluster resource in Terraform, making it impossible to create new clusters or update existing clusters via Terraform in our case.
Some read/related issues:
https://github.com/rancher/rancher/releases/tag/v2.6.11 (see last item under Rancher Behavioural Changes)
#22416
#37634
Note that in the issues and the Rancher v2.6.11 release notes mentioned above the new option is called useInstanceHostnameMetadata
and in our Rancher cluster config it appeared as useInstanceMetadataHostname
with the default value false after upgrading from Rancher v2.6.10 to v2.6.11. So even if we adhered to the newly added option as stated in the release notes and changed useInstanceHostnameMetadata
its value from false to true, it wouldn't have helped because of the wrong name being documented.
I also can't seem to find anything in the Rancher documentation related to useInstanceHostnameMetadata or useInstanceMetadataHostname making it hard to find out which one to actually use, instead of just breaking your clusters and manually modify the newly added boolean to true afterwards. Checked https://github.com/rancher/rke1-docs/blob/release/v2.7.2/docs/config-options/cloud-providers/aws/aws.md but couldn't find anything related.
Describe the solution you'd like
- Add useInstanceMetadataHostname to rancher2_cluster cloud_provider resource in Terraform: https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster#cloud_provider.
- Add information about this option to the Rancher docs: https://rke.docs.rancher.com/config-options/cloud-providers/aws.
- Make it way more clear that Rancher RKE1 clusters with the AWS Cloud Provides enabled and based on EC2 should be really careful while upgrading to Rancher >= v2.6.11.
Describe alternatives you've considered
Manually adding --hostname-override to each node, but that doesn't seem to work for the AWS cloud provider either: https://rke.docs.rancher.com/config-options/nodes#overriding-the-hostname
There is an exception for the AWS cloud provider, where the hostname_override field will be explicitly ignored.
Additional context
We should probably have read and tested better, but its a nasty change and pushing such config by default and not documenting it properly really hurts. I can imagine we're not the only one affected by this change, so maybe we should make it more clear or even push a warning to the customer before they touch their cluster config in Rancher UI/Terraform in Rancher >= v2.6.11?
Not even sure what the actual upgrade path would be in this case. You can't add a the config to a cluster before Rancher v2.6.11 because it didn't exist yet, I assume (not tested yet)? So you should first update Rancher to v2.6.11, then quickly add the useInstanceHostnameMetadata: true
to all your cluster configs before Rancher starts push the default value (false) to them?