Skip to content

[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gabrieldrouin
Copy link
Contributor

@gabrieldrouin gabrieldrouin commented May 11, 2025

Problem Statement

The HelixUtils class has been updated in PR 1734 in order to add exponential backoff to improve resiliency against temporary ZK connection issues through a new handleFailedHelixOperation method, compared with the previous implementation which would make immediate retries instead of waiting exponentially between retries.

This PR aims to integrate this new exponential backoff implementation to other methods in HelixUtils with retry logic, that is:

  • getChildren
  • connectHelixManager
  • checkClusterSetup

Additionally, these 3 methods had inconsistent implementation patterns for retry logic, which have been refactored to more closely match the implementation in PR 1734.

Solution

The 3 methods now use exponential backoff retry logic through the handleFailedHelixOperation method.

Code changes

  • handleFailedHelixOperation now handles the case where the caller doesn't need to specify a path to be added to the logger (for connectHelixManger and checkClusterSetup).
  • retryInterval params are removed, as the interval is handled by handleFailedHelixOperation.
  • Refactored the implementation of loops and conditionals for retry logic to more closely match the new implementation in PR 1734.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@gabrieldrouin
Copy link
Contributor Author

gabrieldrouin commented May 11, 2025

The initial commit f780377 adds exponential backoff to 3 methods using handleFailedHelixOperation. I've identified three key issues to address before merging:

  1. Inconsistent exception handling:
  • handleFailedHelixOperation throws ZkDataAccessException when max retries are reached
  • The 3 modified methods handle max retries themselves by throwing a VeniceException

For example:

public static void checkClusterSetup(HelixAdmin admin, String cluster, int retryCount) {
    int attempt = 0;
    while (attempt < retryCount) {
      if (admin.getClusters().contains(cluster)) {
        break;
      } else {
        attempt++;
        // This condition insures ZkDataAccessException is never thrown in handleFailedHelixOperation
        if (attempt < retryCount) {
          handleFailedHelixOperation("", "checkClusterSetup", attempt, retryCount);
        } else {
          throw new VeniceException("Cluster has not been initialized by controller after attempted: " + attempt);
        }
      }
    }
  }

This could be improved, for example, by:

  • Modifying handleFailedHelixOperation to support different exception types depending on the caller
  • Moving all exception throwing outside handleFailedHelixOperation such that any possible case can be handled (ZK-related or not)
  1. Context-specific logging:
  • getChildren logs specific information about expected vs. actual elements
  • This context isn't captured in the current handleFailedHelixOperation function signature

The current suggested implementation logs this info before calling handleFailedHelixOperation, like so:

      if (children.size() != expectedCount) {
        // Data is inconsistent
        attempt++;
        LOGGER.info("Expected number of elements: {}, but got {}.", expectedCount, children.size());
        handleFailedHelixOperation(path, "getChildren", attempt, retryCount);
      } else {
        return children;
      } 

Which seems cleaner than overloading the method with additional context parameters that would complicate the logging logic (most notably, by not being able to directly use the logger's formatting). For example:

String logMessage = String.format(
        "%s failed with path %s on attempt %d/%d.", 
        helixOperation, 
        path, 
        attempt, 
        retryCount);
        
    if (!additionalContext.isEmpty()) {
      logMessage += " " + additionalContext;
    }
    
    logMessage += String.format(" Will retry in %d seconds.", retryIntervalSec);
    
    LOGGER.error(logMessage);
  1. Parameter removal and API changes:
  • The modified methods (connectHelixManager, checkClusterSetup) no longer accept retryInterval parameters
  • This is a breaking change that might create backward compatibility issues
  • There may be call sites passing this param that now contain dead code
  • As such, all call sites should be audited in a future commit

@gabrieldrouin
Copy link
Contributor Author

gabrieldrouin commented May 21, 2025

In 79b224b, as discussed in the May 19th contributor sync, I've replaced uses of VeniceProperties.empty() with veniceConfigLoader.getCombinedProperties() in HelixParticipationService to ensure configs are properly propagated.

I've pushed a commit with this change only, because I encountered an architecture constraint that I would like to discuss to further my understanding before continuing to implement my solution:

  1. I considered passing veniceServiceConfig directly as a param since it has a getRefreshAttemptsForZkReconnect() method.

  2. However, veniceServiceConfig is not available as a dependency in the da-vinci-client.main module (as shown in the screenshot), which HelixParticipationService is part of.

image

  1. Rather than passing just the raw int (which would defeat the purpose of avoiding hardcoded values), using VeniceProperties props as a param and calling props.getInt(REFRESH_ATTEMPTS_FOR_ZK_RECONNECT, 9) inside the constructor does indeed seem to provide the better implementation in this case.

I wanted to share this issue I've encountered in other to:

  • Ensure my reasoning/understanding were correct
  • Better understand why veniceServiceConfig isn't available as a dependency, but values taken from it are sometimes passed in as params, such as with veniceOfflinePushMonitorAccessor in HelixParticipationService:
    veniceOfflinePushMonitorAccessor = new VeniceOfflinePushMonitorAccessor(
        clusterName,
        zkClient,
        new HelixAdapterSerializer(),
        veniceServerConfig.getRegionName(),
        veniceConfigLoader.getCombinedProperties());

@gabrieldrouin
Copy link
Contributor Author

gabrieldrouin commented May 21, 2025

In my next commit, I plan on addressing HelixReadOnlySchemaRepository and other similar classes which, unlike HelixParticipationService, do not contain a field that returns a config from higher in the hierarchy.

It also has a constructor that currently accepts hardcoded values for refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs that will be removed.

I believe that I will have to further my understanding of how modules interact with each other to determine the best approach for propagating configs to classes that don't have direct access to higher-level VeniceProperties instances at the moment.

@kvargha
Copy link
Contributor

kvargha commented May 23, 2025

In 79b224b, as discussed in the May 19th contributor sync, I've replaced uses of VeniceProperties.empty() with veniceConfigLoader.getCombinedProperties() in HelixParticipationService to ensure configs are properly propagated.

I've pushed a commit with this change only, because I encountered an architecture constraint that I would like to discuss to further my understanding before continuing to implement my solution:

  1. I considered passing veniceServiceConfig directly as a param since it has a getRefreshAttemptsForZkReconnect() method.
  2. However, veniceServiceConfig is not available as a dependency in the da-vinci-client.main module (as shown in the screenshot), which HelixParticipationService is part of.

image

  1. Rather than passing just the raw int (which would defeat the purpose of avoiding hardcoded values), using VeniceProperties props as a param and calling props.getInt(REFRESH_ATTEMPTS_FOR_ZK_RECONNECT, 9) inside the constructor does indeed seem to provide the better implementation in this case.

I wanted to share this issue I've encountered in other to:

  • Ensure my reasoning/understanding were correct
  • Better understand why veniceServiceConfig isn't available as a dependency, but values taken from it are sometimes passed in as params, such as with veniceOfflinePushMonitorAccessor in HelixParticipationService:
    veniceOfflinePushMonitorAccessor = new VeniceOfflinePushMonitorAccessor(
        clusterName,
        zkClient,
        new HelixAdapterSerializer(),
        veniceServerConfig.getRegionName(),
        veniceConfigLoader.getCombinedProperties());

So you don't need to pass the raw props object to every constructor if you don't have access to it. HelixVeniceClusterResources has access to VeniceControllerClusterConfig which has access to VeniceProperties. You can create a getter for the config you would like to use. Please see the examples inside VeniceControllerClusterConfig on how to do that.

@gabrieldrouin gabrieldrouin force-pushed the exp-backoff-other-helix-utils branch from b4cf68d to 287fcb3 Compare May 23, 2025 23:42
@gabrieldrouin
Copy link
Contributor Author

gabrieldrouin commented May 23, 2025

*I cumulated multiple commits after fixing some bugs, which lead to me accidently removing some commits from my branch. Will be more cautious in the future, sorry for that.

287fcb3 adds exponential backoff to several HelixUtils methods, but the scope might be expanding beyond the original intent.

Key issues encountered:

  1. Many objects in RouterServer are created using raw config values (e.g., config.getClusterName()), which gets passed down through multiple layers. I've added the same with config.getRefreshAttemptsForZkReconnect().

As such, classes like CachedResourceZkStateListener and HelixReadOnlySchemaRepository can't access higher-level config objects due to package boundaries. While we can continue this pattern, it doesn't eliminate the risk of users passing hardcoded values, which was a primary goal.

  1. CachedResourceZkStateListener uses a hardcoded DEFAULT_RETRY_LOAD_ATTEMPTS = 1 class field in its constructor. There are 3 callers that could be refactored to pass config values, but require deeper architectural changes that I plan on working on it another commit.

Additionally, there is a linear retry logic in this class that could be replaced with exponential backoff (either by using the new method from HelixUtils or implementing locally). Added TODO for future implementation.

  1. Added TODO in HelixSchemaAccessor due to a hardcoded refreshAttemptsForZkReconnect value where the caller currently lacks access to config/VeniceProperties object to pass proper configured values. Will seek solution in another commit.

Copy link
Contributor

@kvargha kvargha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass. Let's try to scope it down to HelixUtils as much as possible.

Comment on lines 60 to 64
// Sleep a random time(no more than retryLoadIntervalInMs) to avoid thunderstorm issue that all nodes are
// trying to refresh resource at the same time if there is a network issue in that DC.
// TODO: refactor to use exponential backoff like implemented in HelixUtils
long retryLoadIntervalInMs = TimeUnit.SECONDS.toMillis(2);
Utils.sleep((long) (Math.random() * retryLoadIntervalInMs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should modify the sleep internval here. Please read the above comment as to why it's setup like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 84a0c2d

Comment on lines -121 to +124
private final IZkChildListener zkStoreRepositoryListener = new IZkChildListener() {
@Override
public void handleChildChange(String path, List<String> children) {
if (!path.equals(clusterStoreRepositoryPath)) {
LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath);
return;
}
onRepositoryChanged(children);
private final IZkChildListener zkStoreRepositoryListener = (path, children) -> {
if (!path.equals(clusterStoreRepositoryPath)) {
LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath);
return;
}
onRepositoryChanged(children);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this to lambda function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention it in my comment:

In the docs, I found no mentions in the style guide regarding lambda functions, but it was suggested to me as a refactoring by Intellij, and thought I could suggest it as a change. Although, in retrospect, I should've proposed this in another PR as this outside the scope of this current PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a style guide against lambda functions. I just wanted to confirm what prompted this change, and to verify if the functionality is the same.

@@ -28,7 +28,7 @@ public HelixReadOnlyZKSharedSystemStoreRepository(
ZkClient zkClient,
HelixAdapterSerializer compositeSerializer,
String systemStoreClusterName) {
super(zkClient, compositeSerializer, systemStoreClusterName, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were these 0 values for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs, but weren't used in the HelixReadOnlyStoreRepository's constructor:

public HelixReadOnlyStoreRepository(
      ZkClient zkClient,
      HelixAdapterSerializer compositeSerializer,
      String clusterName,
      int refreshAttemptsForZkReconnect,
      long refreshIntervalForZkReconnectInMs) {
    /**
     * HelixReadOnlyStoreRepository is used in router, server, fast-client, da-vinci and system store.
     * Its centralized locking should NOT be shared with other classes. Create a new instance.
     */
    super(zkClient, clusterName, compositeSerializer, new ClusterLockManager(clusterName));
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I wonder why it was setup like this in the first place.

@@ -23,7 +23,7 @@ public SubscriptionBasedStoreRepository(
ZkClient zkClient,
HelixAdapterSerializer compositeSerializer,
String clusterName) {
super(zkClient, compositeSerializer, clusterName, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason

1000,
LogContext.EMPTY);
LogContext.EMPTY,
VeniceProperties.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're creating a VeniceProperties on line 98. We can pass that in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 84a0c2d

1000,
cluster);
cluster,
9);
Copy link
Contributor Author

@gabrieldrouin gabrieldrouin May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-hard-coded 3 and wouldn't bother with passing in a value from a config, since in any case, this method is deprecated.

@gabrieldrouin
Copy link
Contributor Author

gabrieldrouin commented May 27, 2025

To summarize the changes made in this PR (as of efa526a):

  • HelixUtils methods getChildren, handleFailedHelixOperation and connectHelixManager now use exponential backoff retry logic from handleFailedHelixOperation
  • Standardized default REFRESH_ATTEMPTS_FOR_ZK_RECONNECT from 3 to 9
  • refreshIntervalForZkReconnectInMs param was removed from getChildren because of exponential backoff (which affected HelixSchemaAccessor). Callers will be able to use a flag for setting either linear or exponential backoff in a future PR.
  • Hard-coded refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs values in VeniceOfflinePushMonitorAccessor were removed and now use props.getInt(..., 9); whenever possible (which affected HelixVeniceClusterResources).
  • Standardized var names in CachedRessourceZkStateListener
  • Removed unused params in HelixReadOnlyStoreRepository and HelixReadOnlySchemaRepository
  • Updated tests to match the new implementations

I believe to have scoped down to HelixUtils as much as possible, and kept hard-coded params whenever the callers couldn't pass in config/props, replacing my previous usage of VeniceProperties.empty().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants