-
Notifications
You must be signed in to change notification settings - Fork 97
[controller] Cleanup acls from store deletion if admin consumption times out #1804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
timeoutException); | ||
deleteAclsForStore(store, storeName); | ||
} catch (Exception e) { | ||
LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the LHS code propagate the Exception to the caller, but RHS does not, though it logs it which is a good thing.
Do we want the caller not receive the exception??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you referring to, by LHS and RHS? I don't see any equals sign here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both catch exception branches, we should throw the exception and bubble it up to the callers too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch - updated the change to throw the exception so it gets bubbled up to the callers too
LOGGER.warn("Store object for {} is missing! Skipping acl deletion!", storeName); | ||
} | ||
deleteAclsForStore(store, storeName); | ||
} catch (AdminMessageConsumptionTimeoutException timeoutException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions are being passed to the caller, so why do we need AdminMessageConsumptionTimeoutException? Exceptions are being set in the vencieResponse, it is not silent... it is just not logging in the controller logs, but logging to the user who has issued the deleteStore command.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need AdminMessageConsumptionTimeoutException
to delete the acls if we timeout waiting for the admin message to be consumed. This is needed because once we send the delete store admin message, it will be processed in the child regions eventually even if the parent times out waiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. But another point to think about - should we add a try-catch around only deleteAclsForStore
? This will make it clear to the reader that AdminMessageConsumptionTimeoutException
can be thrown only by deleteAclsForStore? If we do that, we do not have to catch and call deleteAclsForStore
again. We just catch it and log it. and then have a common deleteAclsForStore
do the acl deletion. It will work if we do not propogate the AdminMessageConsumptionTimeoutException in the catch block.
It will also add nested try blocks, which some find not neat. This is just an idea - not an ask for change. cc @huangminchn wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try {
LOGGER.info("Deleting store: {} from cluster: {}", storeName, clusterName);
try {
store = getVeniceHelixAdmin().checkPreConditionForDeletion(clusterName, storeName);
} catch (VeniceNoStoreException e) {
// It's possible for a store to partially exist due to partial delete/creation failures.
LOGGER.warn("Store object is missing for store: {} will proceed with the rest of store deletion", storeName);
}
...
sendAdminMessageAndWaitForConsumed(clusterName, storeName, message);
...
try {
sendAdminMessageAndWaitForConsumed(clusterName, storeName, message);
} catch (AdminMessageConsumptionTimeoutException timeoutException) {
LOGGER.warn("log the error and continue deleting the acl if waiting for admin message consumption was the only problem");
}
deleteAclsForStore(store, storeName);
} catch (Exception e) {
LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there is already a try inside the outer try, so adding one more will not make it worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the try catch around sendAdminMessageAndWaitForConsumed
, we will not propagate the exception up to the caller. Is this expected behavior?
storeName, | ||
clusterName, | ||
timeoutException); | ||
deleteAclsForStore(store, storeName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically you want to try deleteAclsForStore one more time?? There does not seem to be much use of it.
Should you move deleteAclsForStore to finally block instead??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will we decide that acls must be cleanup?
If the admin message is not consumed, should we still clean the ACLs ??
The LHS code only clean up the ACL when admin message has been consumed, was it not intentional? In other words,
"After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls" BUT should we do the cleanup? should we move the cleanup to child admins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try-finally
sounds to me like the best way to ensure that acls will always get cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we end up in the timeout exception branch, the acl deletion in the happy path wouldn't happen, so we are not deleting the ACL again; but in theory, this operation should be idempotent too.
Let's add more context to help understand the current solution and see if it makes senses:
- Once the admin command is persisted into the admin topic, it will be processed eventually in all regions, including parent and children; so once we confirm the delete-store message is persisted, it is safe to go ahead and delete the ACL.
- The current message waiting logic doesn't consider children regions, so if we want to only delete ACL when everything is cleanup, parent controllers need to monitor the progress in child regions too, which is not the current case and probably not necessary based on point 1 above
- There could be other errors inside "sendAdminMessageAndWaitForConsumed", like failing to produce to admin topic; in that case, we shouldn't try to delete ACL, but the store would still be around; however, "AdminMessageConsumptionTimeoutException" would only happen at the last line of "sendAdminMessageAndWaitForConsumed", at that time, the admin message is persisted.
- Store deletion is a time consuming process; we have timeout set to 150s in prod, while it's only 30s in staging environment; as a result, almost every store deletion in staging ends up as a leak. We will also match EI's timeout to prod, which should stop most of the leaking cases. The fix in the PR is a complement to the remaining cases.
As shared in point 3 above, thus we can't do the deletion in finally, because it's possible that the admin message is not persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In point 3, it is still possible for the store to not be deleted and we delete the acls if the admin message gets skipped, but that should be a very rare case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's why we will want to add monitoring later.
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Show resolved
Hide resolved
storeName, | ||
clusterName, | ||
timeoutException); | ||
deleteAclsForStore(store, storeName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try-finally
sounds to me like the best way to ensure that acls will always get cleaned up.
timeoutException); | ||
deleteAclsForStore(store, storeName); | ||
} catch (Exception e) { | ||
LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you referring to, by LHS and RHS? I don't see any equals sign here?
...nice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceParentHelixAdmin.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Michelle! Let's have another review from @arjun4084346 and see if there are other concerns.
Problem Statement
After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls
Solution
AdminMessageConsumptionTimeoutException
, to identify when we timeout from waiting for an admin message to be consumedAdminMessageConsumptionTimeoutException
and delete the acls for the storeCode changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized
,RWLock
) are used where needed.ConcurrentHashMap
,CopyOnWriteArrayList
).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?