[controller] Cleanup acls from store deletion if admin consumption times out #1804

misyel · 2025-05-15T21:15:09Z

Problem Statement

After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls

Solution

Add in a new VeniceException type, AdminMessageConsumptionTimeoutException, to identify when we timeout from waiting for an admin message to be consumed
Catch the new AdminMessageConsumptionTimeoutException and delete the acls for the store
Catch all other Exceptions and log it so they do not fail silently

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

arjun4084346 · 2025-05-16T20:30:31Z

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

+          timeoutException);
+      deleteAclsForStore(store, storeName);
+    } catch (Exception e) {
+      LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);


So the LHS code propagate the Exception to the caller, but RHS does not, though it logs it which is a good thing.
Do we want the caller not receive the exception??

What are you referring to, by LHS and RHS? I don't see any equals sign here?

In both catch exception branches, we should throw the exception and bubble it up to the callers too.

Thanks for the catch - updated the change to throw the exception so it gets bubbled up to the callers too

arjun4084346 · 2025-05-16T20:34:04Z

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

-        LOGGER.warn("Store object for {} is missing! Skipping acl deletion!", storeName);
-      }
+      deleteAclsForStore(store, storeName);
+    } catch (AdminMessageConsumptionTimeoutException timeoutException) {


Exceptions are being passed to the caller, so why do we need AdminMessageConsumptionTimeoutException? Exceptions are being set in the vencieResponse, it is not silent... it is just not logging in the controller logs, but logging to the user who has issued the deleteStore command.
Am I missing something?

We need AdminMessageConsumptionTimeoutException to delete the acls if we timeout waiting for the admin message to be consumed. This is needed because once we send the delete store admin message, it will be processed in the child regions eventually even if the parent times out waiting

Got it. But another point to think about - should we add a try-catch around only deleteAclsForStore ? This will make it clear to the reader that AdminMessageConsumptionTimeoutException can be thrown only by deleteAclsForStore? If we do that, we do not have to catch and call deleteAclsForStore again. We just catch it and log it. and then have a common deleteAclsForStore do the acl deletion. It will work if we do not propogate the AdminMessageConsumptionTimeoutException in the catch block.
It will also add nested try blocks, which some find not neat. This is just an idea - not an ask for change. cc @huangminchn wdyt

try { LOGGER.info("Deleting store: {} from cluster: {}", storeName, clusterName); try { store = getVeniceHelixAdmin().checkPreConditionForDeletion(clusterName, storeName); } catch (VeniceNoStoreException e) { // It's possible for a store to partially exist due to partial delete/creation failures. LOGGER.warn("Store object is missing for store: {} will proceed with the rest of store deletion", storeName); } ... sendAdminMessageAndWaitForConsumed(clusterName, storeName, message); ... try { sendAdminMessageAndWaitForConsumed(clusterName, storeName, message); } catch (AdminMessageConsumptionTimeoutException timeoutException) { LOGGER.warn("log the error and continue deleting the acl if waiting for admin message consumption was the only problem"); } deleteAclsForStore(store, storeName); } catch (Exception e) { LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e); }

I just realized there is already a try inside the outer try, so adding one more will not make it worse

If we add the try catch around sendAdminMessageAndWaitForConsumed, we will not propagate the exception up to the caller. Is this expected behavior?

arjun4084346 · 2025-05-16T20:34:53Z

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

+          storeName,
+          clusterName,
+          timeoutException);
+      deleteAclsForStore(store, storeName);


so basically you want to try deleteAclsForStore one more time?? There does not seem to be much use of it.
Should you move deleteAclsForStore to finally block instead??

When will we decide that acls must be cleanup?
If the admin message is not consumed, should we still clean the ACLs ??
The LHS code only clean up the ACL when admin message has been consumed, was it not intentional? In other words,
"After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls" BUT should we do the cleanup? should we move the cleanup to child admins?

try-finally sounds to me like the best way to ensure that acls will always get cleaned up.

If we end up in the timeout exception branch, the acl deletion in the happy path wouldn't happen, so we are not deleting the ACL again; but in theory, this operation should be idempotent too.

Let's add more context to help understand the current solution and see if it makes senses:

Once the admin command is persisted into the admin topic, it will be processed eventually in all regions, including parent and children; so once we confirm the delete-store message is persisted, it is safe to go ahead and delete the ACL.

The current message waiting logic doesn't consider children regions, so if we want to only delete ACL when everything is cleanup, parent controllers need to monitor the progress in child regions too, which is not the current case and probably not necessary based on point 1 above

There could be other errors inside "sendAdminMessageAndWaitForConsumed", like failing to produce to admin topic; in that case, we shouldn't try to delete ACL, but the store would still be around; however, "AdminMessageConsumptionTimeoutException" would only happen at the last line of "sendAdminMessageAndWaitForConsumed", at that time, the admin message is persisted.

Store deletion is a time consuming process; we have timeout set to 150s in prod, while it's only 30s in staging environment; as a result, almost every store deletion in staging ends up as a leak. We will also match EI's timeout to prod, which should stop most of the leaking cases. The fix in the PR is a complement to the remaining cases.

As shared in point 3 above, thus we can't do the deletion in finally, because it's possible that the admin message is not persisted.

In point 3, it is still possible for the store to not be deleted and we delete the acls if the admin message gets skipped, but that should be a very rare case

Right, that's why we will want to add monitoring later.

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

KaiSernLim · 2025-05-16T21:18:52Z

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

+          storeName,
+          clusterName,
+          timeoutException);
+      deleteAclsForStore(store, storeName);


try-finally sounds to me like the best way to ensure that acls will always get cleaned up.

KaiSernLim · 2025-05-16T21:20:13Z

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

+          timeoutException);
+      deleteAclsForStore(store, storeName);
+    } catch (Exception e) {
+      LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);


What are you referring to, by LHS and RHS? I don't see any equals sign here?

...nice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceParentHelixAdmin.java

huangminchn

Thanks a lot Michelle! Let's have another review from @arjun4084346 and see if there are other concerns.

misyel requested a review from huangminchn May 15, 2025 21:15

arjun4084346 reviewed May 16, 2025

View reviewed changes

KaiSernLim reviewed May 16, 2025

View reviewed changes

huangminchn reviewed May 21, 2025

View reviewed changes

...nice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceParentHelixAdmin.java Show resolved Hide resolved

misyel added 5 commits May 21, 2025 11:29

catch timeout exception when deleting store & cleanup acls

b5b3c90

update comment

a785d64

bubble exception up to caller, refactor deleteAclsForStore

17509e4

add expectedExceptions to test

6f7047e

use try catch for test

abd0a48

misyel force-pushed the catch-timeout branch from c4c7920 to abd0a48 Compare May 21, 2025 18:49

huangminchn approved these changes May 21, 2025

View reviewed changes

[controller] Cleanup acls from store deletion if admin consumption times out #1804

Are you sure you want to change the base?

[controller] Cleanup acls from store deletion if admin consumption times out #1804

Uh oh!

Conversation

misyel commented May 15, 2025

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjun4084346 May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjun4084346 May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

huangminchn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arjun4084346 May 22, 2025 •

edited

Loading

arjun4084346 May 16, 2025 •

edited

Loading