Skip to content

Conversation

@frank-spano
Copy link

@frank-spano frank-spano commented Oct 17, 2025

What does this PR do?

Add rbac for controllerrevisions, and enable controllerrevisions collections in ksm check for agents > 7.72.x

Motivation

DataDog/datadog-agent#40864

Additional Notes

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Deployed operator to minikube, tested with agents 7.71.0, and a few rc's 7.72.0-x.

For QA, the branch should be checked out and operator deployed. Then, you'll want to test deploying the agent with a version > 7.72.0, and < 7.72.0. In both cases, the ksm check should run with no errors. If you run agent config check, and look in the resources section, you should see controllerrevisions present for agents > 7.72.0

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <apikey>
      appKey: <appkey>
  features:
    kubeStateMetricsCore:
      enabled: true
  override:
    nodeAgent:
      image:
       #name: gcr.io/datadoghq/agent:7.71.0
        name: gcr.io/datadoghq/agent:7.72.0-rc.8
      tolerations:
        - operator: Exists
    clusterAgent:
      image:
        #name: gcr.io/datadoghq/agent:7.71.0
        name: gcr.io/datadoghq/agent:7.72.0-rc.8
      tolerations:
        - operator: Exists

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@frank-spano frank-spano requested review from a team as code owners October 17, 2025 16:56
@frank-spano frank-spano added qa/skip-qa enhancement New feature or request labels Oct 17, 2025
@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 34.14634% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.17%. Comparing base (e64a212) to head (4834283).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
pkg/utils/version.go 0.00% 23 Missing ⚠️
pkg/testutils/builder.go 0.00% 21 Missing ⚠️
...atadogagent/feature/kubernetesstatecore/feature.go 71.42% 6 Missing and 4 partials ⚠️

❌ Your patch status has failed because the patch coverage (34.14%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2249      +/-   ##
==========================================
- Coverage   39.08%   38.17%   -0.91%     
==========================================
  Files         254      256       +2     
  Lines       26307    21453    -4854     
==========================================
- Hits        10282     8190    -2092     
+ Misses      15412    12635    -2777     
- Partials      613      628      +15     
Flag Coverage Δ
unittests 38.17% <34.14%> (-0.91%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...adogagent/feature/kubernetesstatecore/configmap.go 97.22% <100.00%> (+1.01%) ⬆️
...r/datadogagent/feature/kubernetesstatecore/rbac.go 100.00% <100.00%> (ø)
internal/controller/datadogagent_controller.go 66.66% <ø> (+0.59%) ⬆️
...atadogagent/feature/kubernetesstatecore/feature.go 83.62% <71.42%> (+12.09%) ⬆️
pkg/testutils/builder.go 0.00% <0.00%> (ø)
pkg/utils/version.go 30.55% <0.00%> (-52.78%) ⬇️

... and 234 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e64a212...4834283. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tbavelier tbavelier modified the milestones: v1.20.0, v1.21.0 Oct 20, 2025
Copy link
Member

@tbavelier tbavelier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed it (code wise), but tests should be fixed first: https://gitlab.ddbuild.io/DataDog/datadog-operator/-/jobs/1185963278#L641

        	Error:      	Should be empty, but was resource controllerrevisions does not exist. Available resources: endpoints,persistentvolumeclaims,apiextensions.k8s.io/v1, Resource=customresourcedefinitions,core/v1, Resource=nodes_extended,configmaps,namespaces,poddisruptionbudgets,rolebindings,secrets,volumeattachments,datadoghq.com/v2alpha1, Resource=datadogagents,clusterrolebindings,mutatingwebhookconfigurations,replicasets,roles,storageclasses,certificatesigningrequests,endpointslices,leases,nodes,services,apiregistration.k8s.io/v1, Resource=apiservices,apps/v1, Resource=deployments_extended,horizontalpodautoscalers,ingresses,pods,replicationcontrollers,statefulsets,batch/v1, Resource=jobs_extended,resourcequotas,validatingwebhookconfigurations,core/v1, Resource=pods_extended,clusterroles,cronjobs,daemonsets,jobs,limitranges,ingressclasses,networkpolicies,persistentvolumes,serviceaccounts,autoscaling.k8s.io/v1, Resource=verticalpodautoscalers,apps/v1, Resource=replicasets_extended,deployments

Moreover, QA instructions should be added on top of the unit/e2e tests if those cover the addition, thanks!

// This requires agent version 7.72.0 or later.
// Default: false
// +optional
CollectControllerRevisions *bool `json:"collectControllerRevisions,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: Do we want to expose a configuration option that enables the collection of ControllerRevision resources — or should we instead use these resources internally to derive higher-level insights?

Based on our previous discussion, this seems related to exposing rollout duration for DaemonSets and StatefulSets.

If that’s the case, I wonder whether the new setting should be framed around the benefit or outcome (e.g., enabling rollout duration metrics) rather than the low-level implementation detail (collecting ControllerRevisions).

What do you think? Open to discussion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clamoriniere So the collection of DaemonSets and StatefulSets will still occur by default, it's just that the "start time" of the deployment would be less accurate without controllerrevisions access. However, this flag was meant as more of a killswitch than anything else, because enabling this on agent < 7.72.0 breaks the KSM check, or if there was a custom version that was 7.72 but tagged as something non-parsable, we wanted to be able to turn it on manually. I'm open to removing this option and just strictly having it version gated in the feature.go code.

Also - and this is something I'd probably rather not do (hence being extra careful with the operator and helm-chart deployments) - there's the option of modifying the agent code to NOT fail the ksm check if it's configured to collect controllerrevisions but doesn't have access. We just remove it from the list and continue normal function.

Happy to hear your thoughts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that makes sense.

I think a good approach would be to always add the RBAC for controllerrevisions when the KSM check is enabled. Then, starting from the next Agent/Cluster-Agent version, we could detect at runtime whether the permission is available. If it isn’t, we simply skip generating the metrics that rely on controllerrevisions.

This way, the check can still run successfully without introducing a new configuration parameter, which should provide a smoother experience for users and reduce long-term config maintenance.

Additionally, we could emit a check configuration warning when the RBAC is missing — the check would continue running, but the user would see a warning in the Agent status suggesting they add the missing permission or upgrade their Operator/Helm chart version.

I realize this involves a bit more work, but it would likely result in a better UX and a cleaner setup overall.
let me know what is your opinion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on this PR the rbac will always be enabled, regardless of agent version. As far as the agent code changes, I may have misspoke - newer agents that don't have controllerrevisions enabled will still function, it's just that the start times for statefulsets will be less accurate. The case where the ksm check would break is in older agents < 7.72.0, if we enable collection of controllerrevisions (adding the rbac is fine). Unfortunately we can't fix that in the code, so we just have to make sure it's not enabled for those agents.

I'm not against the runtime check for resources in the agent, but it won't help us with this release, so I think it can be something we implement in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for having a config option in the CRD as opposed to automatically enabling controller revisions metric collection based on the agent version? I'd prefer to keep the feature config options simple and limited since a user can always provide a custom config if they want more fine grained control over what's collected

@tbavelier tbavelier dismissed their stale review October 27, 2025 16:36

tests were updated/are now passing. I never took the time to review the code itself

// This requires agent version 7.72.0 or later.
// Default: false
// +optional
CollectControllerRevisions *bool `json:"collectControllerRevisions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for having a config option in the CRD as opposed to automatically enabling controller revisions metric collection based on the agent version? I'd prefer to keep the feature config options simple and limited since a user can always provide a custom config if they want more fine grained control over what's collected

const crdAPIServiceCollectionMinVersion = "7.46.0-0"

// Minimum agent version that supports collection of controllerrevisions
const controllerRevisionsCollectionMinVersion = "7.72.0-0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and crdAPIServiceCollectionMinVersion are both about minimum agent versions, I think it makes sense to group them together as constants

const (
  first = "aaa"
  second = "bbb"
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will group

var output feature.RequiredComponents

if ddaSpec.Features != nil && ddaSpec.Features.KubeStateMetricsCore != nil && apiutils.BoolValue(ddaSpec.Features.KubeStateMetricsCore.Enabled) {
f.logger.Info("KubeStateMetricsCore feature enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already log out enabled features elsewhere. Could you also remove the other log lines? We don't generally log whether a config is set or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing

f.serviceAccountName = constants.GetClusterAgentServiceAccount(dda.GetName(), ddaSpec)

// Determine CollectControllerRevisions setting
// Priority: 1) Explicit spec setting, 2) Image override version check, 3) Default image version check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have complicated logic for this and the behavior is not consistent with other config fields in the CRD. I think we should auto enable based on agent version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this logic, and for having the variable in the config rather than just having it enabled by version is that having the wrong value breaks the ksm check in older agents, and reduces the accuracy in new versions. So we added it as an override if there was a custom image tag or some situation where the version wasn't parsing properly. @sblumenthal Given that we simplified the helm chart to just use the agent version, should we remove this logic and just do a simple version check here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally okay with removing it... it would remove our ability to explicitly have this part of the check be enabled when deploying an image built off of a branch from the datadog-agent repo, but in reality I don't think that is a scenario we currently find ourselves using

// IsAboveMinVersionWithFallback uses semver to check if `version` is >= minVersion.
// If semver parsing fails, it falls back to parsing major.minor versions only.
// For versions that can't be parsed at all, it returns false (conservative approach).
func IsAboveMinVersionWithFallback(version, minVersion string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we create X.Y formatted agent image tags or are there other formats that you're looking to cover with this function that we don't with IsAboveMinVersion? We should keep version checking behavior consistent and use a single function since we use version checking multiple times in the operator code. It isn't intuitive if we have two similar functions where one defaults to true and the other false for the same image tag. If there are standard agent image tag formats we're not parsing correctly I think it's better to change IsAboveMinVersion directly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added for extra sanity, but can be removed if we decide to enable based on version only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought I'd like to leave this function. IsAboveMinVersion(..) returns true if the version is non-semver. In our case, we want to default to false if we can't parse the version, as enabling controllerrevisions in agents < 7.72.0 will break the check, where as the opposite is not as damaging.

@frank-spano
Copy link
Author

@khewonc pushed some changes if you wouldn't mind having a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request qa/skip-qa

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants