-
Notifications
You must be signed in to change notification settings - Fork 97
[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[dvc][server] Add exponential backoff to other HelixUtils methods with retries #1784
Conversation
The initial commit f780377 adds exponential backoff to 3 methods using
For example: public static void checkClusterSetup(HelixAdmin admin, String cluster, int retryCount) {
int attempt = 0;
while (attempt < retryCount) {
if (admin.getClusters().contains(cluster)) {
break;
} else {
attempt++;
// This condition insures ZkDataAccessException is never thrown in handleFailedHelixOperation
if (attempt < retryCount) {
handleFailedHelixOperation("", "checkClusterSetup", attempt, retryCount);
} else {
throw new VeniceException("Cluster has not been initialized by controller after attempted: " + attempt);
}
}
}
} This could be improved, for example, by:
The current suggested implementation logs this info before calling if (children.size() != expectedCount) {
// Data is inconsistent
attempt++;
LOGGER.info("Expected number of elements: {}, but got {}.", expectedCount, children.size());
handleFailedHelixOperation(path, "getChildren", attempt, retryCount);
} else {
return children;
} Which seems cleaner than overloading the method with additional context parameters that would complicate the logging logic (most notably, by not being able to directly use the logger's formatting). For example: String logMessage = String.format(
"%s failed with path %s on attempt %d/%d.",
helixOperation,
path,
attempt,
retryCount);
if (!additionalContext.isEmpty()) {
logMessage += " " + additionalContext;
}
logMessage += String.format(" Will retry in %d seconds.", retryIntervalSec);
LOGGER.error(logMessage);
|
internal/venice-common/src/main/java/com/linkedin/venice/utils/HelixUtils.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/helix/HelixSchemaAccessor.java
Show resolved
Hide resolved
In 79b224b, as discussed in the May 19th contributor sync, I've replaced uses of I've pushed a commit with this change only, because I encountered an architecture constraint that I would like to discuss to further my understanding before continuing to implement my solution:
I wanted to share this issue I've encountered in other to:
veniceOfflinePushMonitorAccessor = new VeniceOfflinePushMonitorAccessor(
clusterName,
zkClient,
new HelixAdapterSerializer(),
veniceServerConfig.getRegionName(),
veniceConfigLoader.getCombinedProperties()); |
In my next commit, I plan on addressing It also has a constructor that currently accepts hardcoded values for I believe that I will have to further my understanding of how modules interact with each other to determine the best approach for propagating configs to classes that don't have direct access to higher-level |
So you don't need to pass the raw props object to every constructor if you don't have access to it. |
b4cf68d
to
287fcb3
Compare
*I cumulated multiple commits after fixing some bugs, which lead to me accidently removing some commits from my branch. Will be more cautious in the future, sorry for that. 287fcb3 adds exponential backoff to several HelixUtils methods, but the scope might be expanding beyond the original intent. Key issues encountered:
As such, classes like
Additionally, there is a linear retry logic in this class that could be replaced with exponential backoff (either by using the new method from
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass. Let's try to scope it down to HelixUtils as much as possible.
// Sleep a random time(no more than retryLoadIntervalInMs) to avoid thunderstorm issue that all nodes are | ||
// trying to refresh resource at the same time if there is a network issue in that DC. | ||
// TODO: refactor to use exponential backoff like implemented in HelixUtils | ||
long retryLoadIntervalInMs = TimeUnit.SECONDS.toMillis(2); | ||
Utils.sleep((long) (Math.random() * retryLoadIntervalInMs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should modify the sleep internval here. Please read the above comment as to why it's setup like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted in 84a0c2d
private final IZkChildListener zkStoreRepositoryListener = new IZkChildListener() { | ||
@Override | ||
public void handleChildChange(String path, List<String> children) { | ||
if (!path.equals(clusterStoreRepositoryPath)) { | ||
LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath); | ||
return; | ||
} | ||
onRepositoryChanged(children); | ||
private final IZkChildListener zkStoreRepositoryListener = (path, children) -> { | ||
if (!path.equals(clusterStoreRepositoryPath)) { | ||
LOGGER.warn("Notification path mismatch, path={}, expected={}.", path, clusterStoreRepositoryPath); | ||
return; | ||
} | ||
onRepositoryChanged(children); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change this to lambda function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention it in my comment:
In the docs, I found no mentions in the style guide regarding lambda functions, but it was suggested to me as a refactoring by Intellij, and thought I could suggest it as a change. Although, in retrospect, I should've proposed this in another PR as this outside the scope of this current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a style guide against lambda functions. I just wanted to confirm what prompted this change, and to verify if the functionality is the same.
@@ -28,7 +28,7 @@ public HelixReadOnlyZKSharedSystemStoreRepository( | |||
ZkClient zkClient, | |||
HelixAdapterSerializer compositeSerializer, | |||
String systemStoreClusterName) { | |||
super(zkClient, compositeSerializer, systemStoreClusterName, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were these 0 values for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For refreshAttemptsForZkReconnect
and refreshIntervalForZkReconnectInMs
, but weren't used in the HelixReadOnlyStoreRepository
's constructor:
public HelixReadOnlyStoreRepository(
ZkClient zkClient,
HelixAdapterSerializer compositeSerializer,
String clusterName,
int refreshAttemptsForZkReconnect,
long refreshIntervalForZkReconnectInMs) {
/**
* HelixReadOnlyStoreRepository is used in router, server, fast-client, da-vinci and system store.
* Its centralized locking should NOT be shared with other classes. Create a new instance.
*/
super(zkClient, clusterName, compositeSerializer, new ClusterLockManager(clusterName));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I wonder why it was setup like this in the first place.
@@ -23,7 +23,7 @@ public SubscriptionBasedStoreRepository( | |||
ZkClient zkClient, | |||
HelixAdapterSerializer compositeSerializer, | |||
String clusterName) { | |||
super(zkClient, compositeSerializer, clusterName, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason
internal/venice-common/src/main/java/com/linkedin/venice/utils/HelixUtils.java
Show resolved
Hide resolved
1000, | ||
LogContext.EMPTY); | ||
LogContext.EMPTY, | ||
VeniceProperties.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're creating a VeniceProperties on line 98. We can pass that in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 84a0c2d
1000, | ||
cluster); | ||
cluster, | ||
9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-hard-coded 3
and wouldn't bother with passing in a value from a config, since in any case, this method is deprecated.
To summarize the changes made in this PR (as of efa526a):
I believe to have scoped down to HelixUtils as much as possible, and kept hard-coded params whenever the callers couldn't pass in config/props, replacing my previous usage of |
Problem Statement
The
HelixUtils
class has been updated in PR 1734 in order to add exponential backoff to improve resiliency against temporary ZK connection issues through a newhandleFailedHelixOperation
method, compared with the previous implementation which would make immediate retries instead of waiting exponentially between retries.This PR aims to integrate this new exponential backoff implementation to other methods in
HelixUtils
with retry logic, that is:getChildren
connectHelixManager
checkClusterSetup
Additionally, these 3 methods had inconsistent implementation patterns for retry logic, which have been refactored to more closely match the implementation in PR 1734.
Solution
The 3 methods now use exponential backoff retry logic through the
handleFailedHelixOperation
method.Code changes
handleFailedHelixOperation
now handles the case where the caller doesn't need to specify a path to be added to the logger (forconnectHelixManger
andcheckClusterSetup
).retryInterval
params are removed, as the interval is handled byhandleFailedHelixOperation
.Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized
,RWLock
) are used where needed.ConcurrentHashMap
,CopyOnWriteArrayList
).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?