-
Couldn't load subscription status.
- Fork 128
add controllerrevisions collection in ksm check for agents > 7.72.x #2249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (34.14%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2249 +/- ##
==========================================
- Coverage 39.08% 38.17% -0.91%
==========================================
Files 254 256 +2
Lines 26307 21453 -4854
==========================================
- Hits 10282 8190 -2092
+ Misses 15412 12635 -2777
- Partials 613 628 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 234 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed it (code wise), but tests should be fixed first: https://gitlab.ddbuild.io/DataDog/datadog-operator/-/jobs/1185963278#L641
Error: Should be empty, but was resource controllerrevisions does not exist. Available resources: endpoints,persistentvolumeclaims,apiextensions.k8s.io/v1, Resource=customresourcedefinitions,core/v1, Resource=nodes_extended,configmaps,namespaces,poddisruptionbudgets,rolebindings,secrets,volumeattachments,datadoghq.com/v2alpha1, Resource=datadogagents,clusterrolebindings,mutatingwebhookconfigurations,replicasets,roles,storageclasses,certificatesigningrequests,endpointslices,leases,nodes,services,apiregistration.k8s.io/v1, Resource=apiservices,apps/v1, Resource=deployments_extended,horizontalpodautoscalers,ingresses,pods,replicationcontrollers,statefulsets,batch/v1, Resource=jobs_extended,resourcequotas,validatingwebhookconfigurations,core/v1, Resource=pods_extended,clusterroles,cronjobs,daemonsets,jobs,limitranges,ingressclasses,networkpolicies,persistentvolumes,serviceaccounts,autoscaling.k8s.io/v1, Resource=verticalpodautoscalers,apps/v1, Resource=replicasets_extended,deploymentsMoreover, QA instructions should be added on top of the unit/e2e tests if those cover the addition, thanks!
| // This requires agent version 7.72.0 or later. | ||
| // Default: false | ||
| // +optional | ||
| CollectControllerRevisions *bool `json:"collectControllerRevisions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: Do we want to expose a configuration option that enables the collection of ControllerRevision resources — or should we instead use these resources internally to derive higher-level insights?
Based on our previous discussion, this seems related to exposing rollout duration for DaemonSets and StatefulSets.
If that’s the case, I wonder whether the new setting should be framed around the benefit or outcome (e.g., enabling rollout duration metrics) rather than the low-level implementation detail (collecting ControllerRevisions).
What do you think? Open to discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clamoriniere So the collection of DaemonSets and StatefulSets will still occur by default, it's just that the "start time" of the deployment would be less accurate without controllerrevisions access. However, this flag was meant as more of a killswitch than anything else, because enabling this on agent < 7.72.0 breaks the KSM check, or if there was a custom version that was 7.72 but tagged as something non-parsable, we wanted to be able to turn it on manually. I'm open to removing this option and just strictly having it version gated in the feature.go code.
Also - and this is something I'd probably rather not do (hence being extra careful with the operator and helm-chart deployments) - there's the option of modifying the agent code to NOT fail the ksm check if it's configured to collect controllerrevisions but doesn't have access. We just remove it from the list and continue normal function.
Happy to hear your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, that makes sense.
I think a good approach would be to always add the RBAC for controllerrevisions when the KSM check is enabled. Then, starting from the next Agent/Cluster-Agent version, we could detect at runtime whether the permission is available. If it isn’t, we simply skip generating the metrics that rely on controllerrevisions.
This way, the check can still run successfully without introducing a new configuration parameter, which should provide a smoother experience for users and reduce long-term config maintenance.
Additionally, we could emit a check configuration warning when the RBAC is missing — the check would continue running, but the user would see a warning in the Agent status suggesting they add the missing permission or upgrade their Operator/Helm chart version.
I realize this involves a bit more work, but it would likely result in a better UX and a cleaner setup overall.
let me know what is your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on this PR the rbac will always be enabled, regardless of agent version. As far as the agent code changes, I may have misspoke - newer agents that don't have controllerrevisions enabled will still function, it's just that the start times for statefulsets will be less accurate. The case where the ksm check would break is in older agents < 7.72.0, if we enable collection of controllerrevisions (adding the rbac is fine). Unfortunately we can't fix that in the code, so we just have to make sure it's not enabled for those agents.
I'm not against the runtime check for resources in the agent, but it won't help us with this release, so I think it can be something we implement in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for having a config option in the CRD as opposed to automatically enabling controller revisions metric collection based on the agent version? I'd prefer to keep the feature config options simple and limited since a user can always provide a custom config if they want more fine grained control over what's collected
tests were updated/are now passing. I never took the time to review the code itself
| // This requires agent version 7.72.0 or later. | ||
| // Default: false | ||
| // +optional | ||
| CollectControllerRevisions *bool `json:"collectControllerRevisions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for having a config option in the CRD as opposed to automatically enabling controller revisions metric collection based on the agent version? I'd prefer to keep the feature config options simple and limited since a user can always provide a custom config if they want more fine grained control over what's collected
| const crdAPIServiceCollectionMinVersion = "7.46.0-0" | ||
|
|
||
| // Minimum agent version that supports collection of controllerrevisions | ||
| const controllerRevisionsCollectionMinVersion = "7.72.0-0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this and crdAPIServiceCollectionMinVersion are both about minimum agent versions, I think it makes sense to group them together as constants
const (
first = "aaa"
second = "bbb"
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will group
| var output feature.RequiredComponents | ||
|
|
||
| if ddaSpec.Features != nil && ddaSpec.Features.KubeStateMetricsCore != nil && apiutils.BoolValue(ddaSpec.Features.KubeStateMetricsCore.Enabled) { | ||
| f.logger.Info("KubeStateMetricsCore feature enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already log out enabled features elsewhere. Could you also remove the other log lines? We don't generally log whether a config is set or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing
| f.serviceAccountName = constants.GetClusterAgentServiceAccount(dda.GetName(), ddaSpec) | ||
|
|
||
| // Determine CollectControllerRevisions setting | ||
| // Priority: 1) Explicit spec setting, 2) Image override version check, 3) Default image version check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to have complicated logic for this and the behavior is not consistent with other config fields in the CRD. I think we should auto enable based on agent version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this logic, and for having the variable in the config rather than just having it enabled by version is that having the wrong value breaks the ksm check in older agents, and reduces the accuracy in new versions. So we added it as an override if there was a custom image tag or some situation where the version wasn't parsing properly. @sblumenthal Given that we simplified the helm chart to just use the agent version, should we remove this logic and just do a simple version check here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally okay with removing it... it would remove our ability to explicitly have this part of the check be enabled when deploying an image built off of a branch from the datadog-agent repo, but in reality I don't think that is a scenario we currently find ourselves using
| // IsAboveMinVersionWithFallback uses semver to check if `version` is >= minVersion. | ||
| // If semver parsing fails, it falls back to parsing major.minor versions only. | ||
| // For versions that can't be parsed at all, it returns false (conservative approach). | ||
| func IsAboveMinVersionWithFallback(version, minVersion string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we create X.Y formatted agent image tags or are there other formats that you're looking to cover with this function that we don't with IsAboveMinVersion? We should keep version checking behavior consistent and use a single function since we use version checking multiple times in the operator code. It isn't intuitive if we have two similar functions where one defaults to true and the other false for the same image tag. If there are standard agent image tag formats we're not parsing correctly I think it's better to change IsAboveMinVersion directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added for extra sanity, but can be removed if we decide to enable based on version only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second thought I'd like to leave this function. IsAboveMinVersion(..) returns true if the version is non-semver. In our case, we want to default to false if we can't parse the version, as enabling controllerrevisions in agents < 7.72.0 will break the check, where as the opposite is not as damaging.
|
@khewonc pushed some changes if you wouldn't mind having a look |
What does this PR do?
Add rbac for controllerrevisions, and enable controllerrevisions collections in ksm check for agents > 7.72.x
Motivation
DataDog/datadog-agent#40864
Additional Notes
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Deployed operator to minikube, tested with agents 7.71.0, and a few rc's 7.72.0-x.
For QA, the branch should be checked out and operator deployed. Then, you'll want to test deploying the agent with a version > 7.72.0, and < 7.72.0. In both cases, the ksm check should run with no errors. If you run
agent config check, and look in the resources section, you should see controllerrevisions present for agents > 7.72.0Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel