Skip to content

Added tolerations support for dcgmexporter and neuronmonitor. #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2024

Conversation

musa-asad
Copy link
Contributor

@musa-asad musa-asad commented May 20, 2024

Description of changes:
In order to support tolerations for aws-observability/helm-charts#41, we need to update these files in the operator to add tolerations for dcgmexporter and neuronmonitor.

How I tested:

  1. I placed these operator changes onto its own ECR repo.
  • I already tested both the new ECR repo operator and helm chart changes for the above pull request.
  • For this, I tested the both the (1) old CRDs + ECR repo operator changes and (2) new CRDs + old operator.

I observed that for (1), tainted nodes never ran any services as expected, but there were no errors. Before setting up (2), I ran into the following error:

W0521 01:12:26.400410    1435 warnings.go:70] unknown field "spec.tolerations"
W0521 01:12:26.444189    1435 warnings.go:70] unknown field "spec.tolerations"

However, it was resolved after updating CRDs with kubectl apply -f helm-charts/charts/amazon-cloudwatch-observability/crds in the updated helm-charts branch, which means there are no issues for both (1) and (2). Also, (2) didn't run changes for the dcgmexporter, which was expected since the change this PR intends to make isn't being applied:

NAME                                                              READY   STATUS    RESTARTS   AGE   IP               NODE                             NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-677b54bszxsf   1/1     Running   0          57s   192.168.53.113   ip-192-168-33-152.ec2.internal   <none>           <none>
cloudwatch-agent-g7h96                                            1/1     Running   0          56s   192.168.54.143   ip-192-168-33-152.ec2.internal   <none>           <none>
fluent-bit-g4znc                                                  1/1     Running   0          57s   192.168.33.152   ip-192-168-33-152.ec2.internal   <none>           <none>

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@musa-asad musa-asad requested review from movence and sky333999 May 20, 2024 16:13
@musa-asad musa-asad self-assigned this May 20, 2024
@sky333999
Copy link
Contributor

sky333999 commented May 20, 2024

If someone is still using the old CRDs but run with the new operator after these changes, would it start to fail?
Perhaps run some tests to confirm the behavior in both cases.. old CRD with new operator and new CRD with old operator.

@musa-asad musa-asad merged commit 53a2afa into main May 21, 2024
3 checks passed
@musa-asad musa-asad deleted the default-tolerations branch May 21, 2024 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants