Skip to content

Disable CC resource goals when resource capacities are not set. #11465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kyguy
Copy link
Member

@kyguy kyguy commented May 22, 2025

Type of change

  • Enhancement / new feature

Description

This PR adds logic to the Strimzi Cluster Operator to remove resource related Cruise Control goals from the default.goals and hard.goals lists by default if the capacity configurations for those resources are not explicitly set in the .spec.kafka.resources or .spec.cruiseControl.brokerCapacity sections of the Kafka custom resource. The following Cruise Control goals are affected:

com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal
com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal

This will prevent users from balancing partitions based on compute resources without defining capacities for those compute resources in their Kafka custom resource.

Addresses the issue raised here: #11409

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@kyguy kyguy added this to the 0.47.0 milestone May 22, 2025
@kyguy kyguy force-pushed the disable-goals branch 4 times, most recently from 68ccb06 to 00f270d Compare June 2, 2025 22:16
@kyguy kyguy force-pushed the disable-goals branch 6 times, most recently from 13781bf to 967c662 Compare June 10, 2025 20:05
Signed-off-by: Kyle Liberti <kliberti.us@gmail.com>
@kyguy kyguy marked this pull request as ready for review June 11, 2025 13:17
@kyguy kyguy requested review from a team, tomncooper, fvaleri and ShubhamRwt June 11, 2025 14:09
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested in one of the comments. Resources are not configured in the Kafka CR anymore. It would be also good to understand what is the plan when some nodes have resources defined and some not. What will you do then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these classes needed (maybe they are, but I'm not really sure)? And why are they needed here in this module (I'm actually pretty sure they are not)?

Copy link
Member Author

@kyguy kyguy Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have refactored the enums into the ResourceRequirementsUtils class and moved it into the cluster-operator module where it is used in the most recent commit. Please have a look when you get a chance and let me know if it needs more work.

Comment on lines 7 to 9
/**
* Enum representing Kubernetes resource types used in container resource specifications.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use it in the places where we work with Kubernetes resources. So this Javadoc is wrong. If it is needed, You should properly explain why and where because this is completely insufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is not accurate. I have refactored this class a bit in the latest commit in attempt to make it's purpose more clear. Please have a look and let me know if the changes make sense or if it needs more work.

package io.strimzi.operator.common.model.resourcerequirements;

/**
* Enum representing the types of resource requirements used in Kubernetes container specifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use it in the places where we work with Kubernetes resources. So this Javadoc is wrong. If it is needed, You should properly explain why and where because this is completely insufficient.

Copy link
Member Author

@kyguy kyguy Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is not accurate. I have refactored this class a bit in the latest commit in attempt to make it's purpose more clear. Please have a look and let me know if the changes make sense or if it needs more work.

Comment on lines 147 to 149
if (this == CPU && CpuCapacity.cpuRequestsMatchLimits(kafkaBrokerResources)) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this important? The CPiu capacity is configured even if request and limit are not the same.

Copy link
Member Author

@kyguy kyguy Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that each broker's CPU capacity has been explicitly defined as a guaranteed resource - neither more or less CPU resources - enabling more accurate rebalancing decisions based on CPU availability.

The CPU capacity is configured even if request and limit are not the same.

Yes, we still set the CPU capacity to the configured requests or limits even if they are not the same in case the user sets the CPU-dependent Cruise Control goals themselves. Although the rebalances won't be as accurate as if the requests == limits, it'll for sure be better than the default CPU capacity value.

Added a comment above the conditional, let me know if it is suitable!

Signed-off-by: Kyle Liberti <kliberti.us@gmail.com>
@kyguy
Copy link
Member Author

kyguy commented Jun 12, 2025

It would be also good to understand what is the plan when some nodes have resources defined and some not. What will you do then?

Thanks for the catch (and review), we only want to consider the CPU capacity for a cluster to be considered to be properly configured for CPU-dependent Cruise Control goals if there is a capacity value explicitly configured for every broker whether through the general capacity or broker specific overrides, or the resource requirements. So if none of these are defined for a broker, including the resource requirement, we want to disable the CPU-dependent Cruise Control goals.

I had wrongly assumed that all the node pools had their resources defined in the kafkaBrokerResources map, I have updated the logic to disable the CPU-dependent Cruise Control goals if there is a broker in the kafkaBrokerNodes set that doesn't have it's node pool (and resource requirements) defined in the kafkaBrokerResources map.

Signed-off-by: Kyle Liberti <kliberti.us@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants