Skip to content

[controller] Cleanup acls from store deletion if admin consumption times out #1804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

misyel
Copy link
Contributor

@misyel misyel commented May 15, 2025

Problem Statement

After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls

Solution

  1. Add in a new VeniceException type, AdminMessageConsumptionTimeoutException, to identify when we timeout from waiting for an admin message to be consumed
  2. Catch the new AdminMessageConsumptionTimeoutException and delete the acls for the store
  3. Catch all other Exceptions and log it so they do not fail silently

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@misyel misyel requested a review from huangminchn May 15, 2025 21:15
timeoutException);
deleteAclsForStore(store, storeName);
} catch (Exception e) {
LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the LHS code propagate the Exception to the caller, but RHS does not, though it logs it which is a good thing.
Do we want the caller not receive the exception??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you referring to, by LHS and RHS? I don't see any equals sign here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both catch exception branches, we should throw the exception and bubble it up to the callers too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch - updated the change to throw the exception so it gets bubbled up to the callers too

LOGGER.warn("Store object for {} is missing! Skipping acl deletion!", storeName);
}
deleteAclsForStore(store, storeName);
} catch (AdminMessageConsumptionTimeoutException timeoutException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions are being passed to the caller, so why do we need AdminMessageConsumptionTimeoutException? Exceptions are being set in the vencieResponse, it is not silent... it is just not logging in the controller logs, but logging to the user who has issued the deleteStore command.
Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need AdminMessageConsumptionTimeoutException to delete the acls if we timeout waiting for the admin message to be consumed. This is needed because once we send the delete store admin message, it will be processed in the child regions eventually even if the parent times out waiting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. But another point to think about - should we add a try-catch around only deleteAclsForStore ? This will make it clear to the reader that AdminMessageConsumptionTimeoutException can be thrown only by deleteAclsForStore? If we do that, we do not have to catch and call deleteAclsForStore again. We just catch it and log it. and then have a common deleteAclsForStore do the acl deletion. It will work if we do not propogate the AdminMessageConsumptionTimeoutException in the catch block.
It will also add nested try blocks, which some find not neat. This is just an idea - not an ask for change. cc @huangminchn wdyt

Copy link
Contributor

@arjun4084346 arjun4084346 May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      try {
        LOGGER.info("Deleting store: {} from cluster: {}", storeName, clusterName);
        try {
          store = getVeniceHelixAdmin().checkPreConditionForDeletion(clusterName, storeName);
        } catch (VeniceNoStoreException e) {
          // It's possible for a store to partially exist due to partial delete/creation failures.
          LOGGER.warn("Store object is missing for store: {} will proceed with the rest of store deletion", storeName);
        }
        ...
        sendAdminMessageAndWaitForConsumed(clusterName, storeName, message);
        ...
        try {
          sendAdminMessageAndWaitForConsumed(clusterName, storeName, message);
        } catch (AdminMessageConsumptionTimeoutException timeoutException) {
          LOGGER.warn("log the error and continue deleting the acl if waiting for admin message consumption was the only problem");
        }
        deleteAclsForStore(store, storeName);
      } catch (Exception e) {
        LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized there is already a try inside the outer try, so adding one more will not make it worse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the try catch around sendAdminMessageAndWaitForConsumed, we will not propagate the exception up to the caller. Is this expected behavior?

storeName,
clusterName,
timeoutException);
deleteAclsForStore(store, storeName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically you want to try deleteAclsForStore one more time?? There does not seem to be much use of it.
Should you move deleteAclsForStore to finally block instead??

Copy link
Contributor

@arjun4084346 arjun4084346 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will we decide that acls must be cleanup?
If the admin message is not consumed, should we still clean the ACLs ??
The LHS code only clean up the ACL when admin message has been consumed, was it not intentional? In other words,
"After we send the delete store admin message, it is possible that we timeout while waiting for the message to be consumed. When this happens, we never cleanup the acls and have leaked acls" BUT should we do the cleanup? should we move the cleanup to child admins?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-finally sounds to me like the best way to ensure that acls will always get cleaned up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up in the timeout exception branch, the acl deletion in the happy path wouldn't happen, so we are not deleting the ACL again; but in theory, this operation should be idempotent too.

Let's add more context to help understand the current solution and see if it makes senses:

  1. Once the admin command is persisted into the admin topic, it will be processed eventually in all regions, including parent and children; so once we confirm the delete-store message is persisted, it is safe to go ahead and delete the ACL.
  2. The current message waiting logic doesn't consider children regions, so if we want to only delete ACL when everything is cleanup, parent controllers need to monitor the progress in child regions too, which is not the current case and probably not necessary based on point 1 above
  3. There could be other errors inside "sendAdminMessageAndWaitForConsumed", like failing to produce to admin topic; in that case, we shouldn't try to delete ACL, but the store would still be around; however, "AdminMessageConsumptionTimeoutException" would only happen at the last line of "sendAdminMessageAndWaitForConsumed", at that time, the admin message is persisted.
  4. Store deletion is a time consuming process; we have timeout set to 150s in prod, while it's only 30s in staging environment; as a result, almost every store deletion in staging ends up as a leak. We will also match EI's timeout to prod, which should stop most of the leaking cases. The fix in the PR is a complement to the remaining cases.

As shared in point 3 above, thus we can't do the deletion in finally, because it's possible that the admin message is not persisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In point 3, it is still possible for the store to not be deleted and we delete the acls if the admin message gets skipped, but that should be a very rare case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's why we will want to add monitoring later.

storeName,
clusterName,
timeoutException);
deleteAclsForStore(store, storeName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-finally sounds to me like the best way to ensure that acls will always get cleaned up.

timeoutException);
deleteAclsForStore(store, storeName);
} catch (Exception e) {
LOGGER.info("Caught an exception when deleting store {} in cluster {}", storeName, clusterName, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you referring to, by LHS and RHS? I don't see any equals sign here?

Copy link
Contributor

@huangminchn huangminchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Michelle! Let's have another review from @arjun4084346 and see if there are other concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants