[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784

gabrieldrouin · 2025-05-11T20:32:46Z

Problem Statement

The HelixUtils class has been updated in PR 1734 in order to add exponential backoff to improve resiliency against temporary ZK connection issues through a new handleFailedHelixOperation method, compared with the previous implementation which would make immediate retries instead of waiting exponentially between retries.

This PR aims to integrate this new exponential backoff implementation to other methods in HelixUtils with retry logic, that is:

getChildren
connectHelixManager
checkClusterSetup

Additionally, these 3 methods had inconsistent implementation patterns for retry logic, which have been refactored to more closely match the implementation in PR 1734.

Solution

The 3 methods now use exponential backoff retry logic through the handleFailedHelixOperation method.

Code changes

handleFailedHelixOperation now handles the case where the caller doesn't need to specify a path to be added to the logger (for connectHelixManger and checkClusterSetup).
retryInterval params are removed, as the interval is handled by handleFailedHelixOperation.
Refactored the implementation of loops and conditionals for retry logic to more closely match the new implementation in PR 1734.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

gabrieldrouin · 2025-05-11T20:35:14Z

The initial commit f780377 adds exponential backoff to 3 methods using handleFailedHelixOperation. I've identified three key issues to address before merging:

Inconsistent exception handling:

handleFailedHelixOperation throws ZkDataAccessException when max retries are reached
The 3 modified methods handle max retries themselves by throwing a VeniceException

For example:

public static void checkClusterSetup(HelixAdmin admin, String cluster, int retryCount) {
    int attempt = 0;
    while (attempt < retryCount) {
      if (admin.getClusters().contains(cluster)) {
        break;
      } else {
        attempt++;
        // This condition insures ZkDataAccessException is never thrown in handleFailedHelixOperation
        if (attempt < retryCount) {
          handleFailedHelixOperation("", "checkClusterSetup", attempt, retryCount);
        } else {
          throw new VeniceException("Cluster has not been initialized by controller after attempted: " + attempt);
        }
      }
    }
  }

This could be improved, for example, by:

Modifying handleFailedHelixOperation to support different exception types depending on the caller
Moving all exception throwing outside handleFailedHelixOperation such that any possible case can be handled (ZK-related or not)

Context-specific logging:

getChildren logs specific information about expected vs. actual elements
This context isn't captured in the current handleFailedHelixOperation function signature

The current suggested implementation logs this info before calling handleFailedHelixOperation, like so:

      if (children.size() != expectedCount) {
        // Data is inconsistent
        attempt++;
        LOGGER.info("Expected number of elements: {}, but got {}.", expectedCount, children.size());
        handleFailedHelixOperation(path, "getChildren", attempt, retryCount);
      } else {
        return children;
      }

Which seems cleaner than overloading the method with additional context parameters that would complicate the logging logic (most notably, by not being able to directly use the logger's formatting). For example:

String logMessage = String.format(
        "%s failed with path %s on attempt %d/%d.", 
        helixOperation, 
        path, 
        attempt, 
        retryCount);
        
    if (!additionalContext.isEmpty()) {
      logMessage += " " + additionalContext;
    }
    
    logMessage += String.format(" Will retry in %d seconds.", retryIntervalSec);
    
    LOGGER.error(logMessage);

Parameter removal and API changes:

The modified methods (connectHelixManager, checkClusterSetup) no longer accept retryInterval parameters
This is a breaking change that might create backward compatibility issues
There may be call sites passing this param that now contain dead code
As such, all call sites should be audited in a future commit

internal/venice-common/src/main/java/com/linkedin/venice/utils/HelixUtils.java

internal/venice-common/src/main/java/com/linkedin/venice/helix/HelixSchemaAccessor.java

gabrieldrouin · 2025-05-21T11:18:59Z

In 79b224b, as discussed in the May 19th contributor sync, I've replaced uses of VeniceProperties.empty() with veniceConfigLoader.getCombinedProperties() in HelixParticipationService to ensure configs are properly propagated.

I've pushed a commit with this change only, because I encountered an architecture constraint that I would like to discuss to further my understanding before continuing to implement my solution:

I considered passing veniceServiceConfig directly as a param since it has a getRefreshAttemptsForZkReconnect() method.
However, veniceServiceConfig is not available as a dependency in the da-vinci-client.main module (as shown in the screenshot), which HelixParticipationService is part of.

Rather than passing just the raw int (which would defeat the purpose of avoiding hardcoded values), using VeniceProperties props as a param and calling props.getInt(REFRESH_ATTEMPTS_FOR_ZK_RECONNECT, 9) inside the constructor does indeed seem to provide the better implementation in this case.

I wanted to share this issue I've encountered in other to:

Ensure my reasoning/understanding were correct
Better understand why veniceServiceConfig isn't available as a dependency, but values taken from it are sometimes passed in as params, such as with veniceOfflinePushMonitorAccessor in HelixParticipationService:

    veniceOfflinePushMonitorAccessor = new VeniceOfflinePushMonitorAccessor(
        clusterName,
        zkClient,
        new HelixAdapterSerializer(),
        veniceServerConfig.getRegionName(),
        veniceConfigLoader.getCombinedProperties());

gabrieldrouin · 2025-05-21T11:34:36Z

In my next commit, I plan on addressing HelixReadOnlySchemaRepository and other similar classes which, unlike HelixParticipationService, do not contain a field that returns a config from higher in the hierarchy.

It also has a constructor that currently accepts hardcoded values for refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs that will be removed.

I believe that I will have to further my understanding of how modules interact with each other to determine the best approach for propagating configs to classes that don't have direct access to higher-level VeniceProperties instances at the moment.

kvargha · 2025-05-23T17:25:35Z

In 79b224b, as discussed in the May 19th contributor sync, I've replaced uses of VeniceProperties.empty() with veniceConfigLoader.getCombinedProperties() in HelixParticipationService to ensure configs are properly propagated.

I've pushed a commit with this change only, because I encountered an architecture constraint that I would like to discuss to further my understanding before continuing to implement my solution:

I considered passing veniceServiceConfig directly as a param since it has a getRefreshAttemptsForZkReconnect() method.

However, veniceServiceConfig is not available as a dependency in the da-vinci-client.main module (as shown in the screenshot), which HelixParticipationService is part of.

Rather than passing just the raw int (which would defeat the purpose of avoiding hardcoded values), using VeniceProperties props as a param and calling props.getInt(REFRESH_ATTEMPTS_FOR_ZK_RECONNECT, 9) inside the constructor does indeed seem to provide the better implementation in this case.

I wanted to share this issue I've encountered in other to:

Ensure my reasoning/understanding were correct

Better understand why veniceServiceConfig isn't available as a dependency, but values taken from it are sometimes passed in as params, such as with veniceOfflinePushMonitorAccessor in HelixParticipationService:
    veniceOfflinePushMonitorAccessor = new VeniceOfflinePushMonitorAccessor(
        clusterName,
        zkClient,
        new HelixAdapterSerializer(),
        veniceServerConfig.getRegionName(),
        veniceConfigLoader.getCombinedProperties());

So you don't need to pass the raw props object to every constructor if you don't have access to it. HelixVeniceClusterResources has access to VeniceControllerClusterConfig which has access to VeniceProperties. You can create a getter for the config you would like to use. Please see the examples inside VeniceControllerClusterConfig on how to do that.

gabrieldrouin · 2025-05-23T23:49:43Z

*I cumulated multiple commits after fixing some bugs, which lead to me accidently removing some commits from my branch. Will be more cautious in the future, sorry for that.

287fcb3 adds exponential backoff to several HelixUtils methods, but the scope might be expanding beyond the original intent.

Key issues encountered:

Many objects in RouterServer are created using raw config values (e.g., config.getClusterName()), which gets passed down through multiple layers. I've added the same with config.getRefreshAttemptsForZkReconnect().

As such, classes like CachedResourceZkStateListener and HelixReadOnlySchemaRepository can't access higher-level config objects due to package boundaries. While we can continue this pattern, it doesn't eliminate the risk of users passing hardcoded values, which was a primary goal.

CachedResourceZkStateListener uses a hardcoded DEFAULT_RETRY_LOAD_ATTEMPTS = 1 class field in its constructor. There are 3 callers that could be refactored to pass config values, but require deeper architectural changes that I plan on working on it another commit.

Additionally, there is a linear retry logic in this class that could be replaced with exponential backoff (either by using the new method from HelixUtils or implementing locally). Added TODO for future implementation.

Added TODO in HelixSchemaAccessor due to a hardcoded refreshAttemptsForZkReconnect value where the caller currently lacks access to config/VeniceProperties object to pass proper configured values. Will seek solution in another commit.

kvargha

Did another pass. Let's try to scope it down to HelixUtils as much as possible.

kvargha · 2025-05-27T17:06:32Z

...nal/venice-common/src/main/java/com/linkedin/venice/helix/CachedResourceZkStateListener.java

          // Sleep a random time(no more than retryLoadIntervalInMs) to avoid thunderstorm issue that all nodes are
          // trying to refresh resource at the same time if there is a network issue in that DC.
+          // TODO: refactor to use exponential backoff like implemented in HelixUtils
+          long retryLoadIntervalInMs = TimeUnit.SECONDS.toMillis(2);
          Utils.sleep((long) (Math.random() * retryLoadIntervalInMs));


I don't think we should modify the sleep internval here. Please read the above comment as to why it's setup like this.

Reverted in 84a0c2d

kvargha · 2025-05-27T17:08:02Z

...rnal/venice-common/src/main/java/com/linkedin/venice/helix/HelixReadOnlyStoreRepository.java

-  private final IZkChildListener zkStoreRepositoryListener = new IZkChildListener() {
-    @Override
-    public void handleChildChange(String path, List<String> children) {
-      if (!path.equals(clusterStoreRepositoryPath)) {
-        LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath);
-        return;
-      }
-      onRepositoryChanged(children);
+  private final IZkChildListener zkStoreRepositoryListener = (path, children) -> {
+    if (!path.equals(clusterStoreRepositoryPath)) {
+      LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath);
+      return;
    }
+    onRepositoryChanged(children);


Why did we change this to lambda function?

Forgot to mention it in my comment:

In the docs, I found no mentions in the style guide regarding lambda functions, but it was suggested to me as a refactoring by Intellij, and thought I could suggest it as a change. Although, in retrospect, I should've proposed this in another PR as this outside the scope of this current PR.

We don't have a style guide against lambda functions. I just wanted to confirm what prompted this change, and to verify if the functionality is the same.

kvargha · 2025-05-27T17:09:46Z

...mmon/src/main/java/com/linkedin/venice/helix/HelixReadOnlyZKSharedSystemStoreRepository.java

@@ -28,7 +28,7 @@ public HelixReadOnlyZKSharedSystemStoreRepository(
      ZkClient zkClient,
      HelixAdapterSerializer compositeSerializer,
      String systemStoreClusterName) {
-    super(zkClient, compositeSerializer, systemStoreClusterName, 0, 0);


What were these 0 values for?

For refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs, but weren't used in the HelixReadOnlyStoreRepository's constructor:

public HelixReadOnlyStoreRepository( ZkClient zkClient, HelixAdapterSerializer compositeSerializer, String clusterName, int refreshAttemptsForZkReconnect, long refreshIntervalForZkReconnectInMs) { /** * HelixReadOnlyStoreRepository is used in router, server, fast-client, da-vinci and system store. * Its centralized locking should NOT be shared with other classes. Create a new instance. */ super(zkClient, clusterName, compositeSerializer, new ClusterLockManager(clusterName)); }

Interesting. I wonder why it was setup like this in the first place.

kvargha · 2025-05-27T17:09:54Z

.../venice-common/src/main/java/com/linkedin/venice/helix/SubscriptionBasedStoreRepository.java

@@ -23,7 +23,7 @@ public SubscriptionBasedStoreRepository(
      ZkClient zkClient,
      HelixAdapterSerializer compositeSerializer,
      String clusterName) {
-    super(zkClient, compositeSerializer, clusterName, 0, 0);


Same reason

internal/venice-common/src/main/java/com/linkedin/venice/utils/HelixUtils.java

kvargha · 2025-05-27T17:14:17Z

...on/src/integrationTest/java/com/linkedin/venice/controller/AbstractTestVeniceHelixAdmin.java

-        1000,
-        LogContext.EMPTY);
+        LogContext.EMPTY,
+        VeniceProperties.empty());


We're creating a VeniceProperties on line 98. We can pass that in here.

Added in 84a0c2d

gabrieldrouin · 2025-05-27T20:35:45Z

internal/venice-test-common/src/main/java/com/linkedin/venice/utils/TestUtils.java

-        1000,
-        cluster);
+        cluster,
+        9);


I've re-hard-coded 3 and wouldn't bother with passing in a value from a config, since in any case, this method is deprecated.

gabrieldrouin · 2025-05-27T21:50:49Z

To summarize the changes made in this PR (as of efa526a):

HelixUtils methods getChildren, handleFailedHelixOperation and connectHelixManager now use exponential backoff retry logic from handleFailedHelixOperation
Standardized default REFRESH_ATTEMPTS_FOR_ZK_RECONNECT from 3 to 9
refreshIntervalForZkReconnectInMs param was removed from getChildren because of exponential backoff (which affected HelixSchemaAccessor). Callers will be able to use a flag for setting either linear or exponential backoff in a future PR.
Hard-coded refreshAttemptsForZkReconnect and refreshIntervalForZkReconnectInMs values in VeniceOfflinePushMonitorAccessor were removed and now use props.getInt(..., 9); whenever possible (which affected HelixVeniceClusterResources).
Standardized var names in CachedRessourceZkStateListener
Removed unused params in HelixReadOnlyStoreRepository and HelixReadOnlySchemaRepository
Updated tests to match the new implementations

I believe to have scoped down to HelixUtils as much as possible, and kept hard-coded params whenever the callers couldn't pass in config/props, replacing my previous usage of VeniceProperties.empty().

init

f780377

kvargha reviewed May 16, 2025

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/utils/HelixUtils.java Show resolved Hide resolved

internal/venice-common/src/main/java/com/linkedin/venice/helix/HelixSchemaAccessor.java Show resolved Hide resolved

wip for standardizing refreshAttemptForZkReconnect

287fcb3

gabrieldrouin force-pushed the exp-backoff-other-helix-utils branch from b4cf68d to 287fcb3 Compare May 23, 2025 23:42

kvargha reviewed May 27, 2025

View reviewed changes

gabrieldrouin added 3 commits May 27, 2025 15:29

code review wip

84a0c2d

fix bugs + refactor connectHelixManager

c727af7

revert some usage of VeniceProperties.empty()

5f0db90

gabrieldrouin commented May 27, 2025

View reviewed changes

gabrieldrouin added 4 commits May 27, 2025 16:37

cleanup

000b7d7

remove other VeniceProperties.empty()

205c0ee

revert refresh attempts val

2ed6743

use props + revert test change

efa526a

[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784

Are you sure you want to change the base?

[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784

Conversation

gabrieldrouin commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

gabrieldrouin commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabrieldrouin commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrieldrouin commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvargha commented May 23, 2025

Uh oh!

gabrieldrouin commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvargha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabrieldrouin May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabrieldrouin commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gabrieldrouin commented May 11, 2025 •

edited

Loading

gabrieldrouin commented May 11, 2025 •

edited

Loading

gabrieldrouin commented May 21, 2025 •

edited

Loading

gabrieldrouin commented May 21, 2025 •

edited

Loading

gabrieldrouin commented May 23, 2025 •

edited

Loading

gabrieldrouin May 27, 2025 •

edited

Loading

gabrieldrouin commented May 27, 2025 •

edited

Loading