Skip to content

Update pauser to handle exception and run unpause as long as possible #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 20, 2025
Merged
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
92ebb3a
Update pauser to handle exception and run unpause as long as possible
kota2and3kan Apr 25, 2025
07d8325
[WIP] Add several unit tests
kota2and3kan Apr 25, 2025
08312b2
[WIP] Add several unit tests
kota2and3kan Apr 28, 2025
34a03fc
Fix minor things
kota2and3kan Apr 28, 2025
25c10f6
Merge remote-tracking branch 'origin/main' into add-error-handling-wh…
kota2and3kan Apr 30, 2025
4d22a34
Update exception handling based on review feedback
kota2and3kan May 1, 2025
42f1a30
Update pauser logic based on feedback and discussion
kota2and3kan May 2, 2025
1d17af9
Update unit tests
kota2and3kan May 7, 2025
f087077
Update dummy object name variable to static constant
kota2and3kan May 7, 2025
d3b54c0
Add exception handling for getTarget() after pause
kota2and3kan May 8, 2025
55ba29c
Apply suggestions from code review
kota2and3kan May 8, 2025
891beb8
Change variable name from backupOk to isPauseOk
kota2and3kan May 8, 2025
c233242
Update name of target status variable and method
kota2and3kan May 8, 2025
744da27
Return PausedDuration from pauseInternal() method
kota2and3kan May 8, 2025
c413386
Move variable definition in catch block
kota2and3kan May 8, 2025
600e7b1
Make the logic a bit simpler
kota2and3kan May 9, 2025
9e9132d
Update error message generation and definition based on feedback
kota2and3kan May 9, 2025
909e596
Add code comments
kota2and3kan May 9, 2025
f9dc445
Implement buildException() based on feedback
kota2and3kan May 13, 2025
5cca44c
Run targetStatusEquals() only if targetAfterPause is not null
kota2and3kan May 14, 2025
639e132
Apply suggestions from code review
kota2and3kan May 15, 2025
b93c054
Fix code comments based on feedback
kota2and3kan May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 38 additions & 12 deletions lib/src/main/java/com/scalar/admin/kubernetes/Pauser.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package com.scalar.admin.kubernetes;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.util.concurrent.Uninterruptibles;
import com.scalar.admin.RequestCoordinator;
import io.kubernetes.client.openapi.Configuration;
Expand Down Expand Up @@ -34,7 +35,7 @@
@NotThreadSafe
public class Pauser {

private static final int MAX_UNPAUSE_RETRY_COUNT = 3;
@VisibleForTesting static final int MAX_UNPAUSE_RETRY_COUNT = 3;

private final Logger logger = LoggerFactory.getLogger(Pauser.class);
private final TargetSelector targetSelector;
Expand Down Expand Up @@ -84,17 +85,30 @@ public PausedDuration pause(int pauseDuration, @Nullable Long maxPauseWaitTime)
throw new PauserException("Failed to find the target pods to pause.", e);
}

RequestCoordinator coordinator = getRequestCoordinator(target);
RequestCoordinator coordinator;
try {
coordinator = getRequestCoordinator(target);
} catch (Exception e) {
throw new PauserException("Failed to initialize the coordinator.", e);
}
Copy link
Collaborator Author

@kota2and3kan kota2and3kan Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added exception handling for the initialization process of RequestCoodinator.


Instant startTime;
Instant endTime;
try {
coordinator.pause(true, maxPauseWaitTime);

coordinator.pause(true, maxPauseWaitTime);
startTime = Instant.now();

Instant startTime = Instant.now();
Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);

Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);
endTime = Instant.now();

Instant endTime = Instant.now();
unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);

unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);
} catch (Exception e) {
unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);
throw e;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main update of this PR.

Previously, there was no exception handling. So, there is a possibility that the Scalar products pod will keep paused status forever. In such a case, the Scalar product does not accept any request. It causes the service down.

To avoid such a case as long as possible, we want to run unpause() if any errors occur in any parts of the process.


TargetSnapshot targetAfterPause;
try {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about executing this block only when targetAfterPause != null? The method seems to return StatusUnmatchedException when the argument after is null, but in this case comparing before and after itself isn't conducted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. That's right. I will fix it. Thank you for the suggestion!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5cca44c.

Expand All @@ -113,8 +127,9 @@ public PausedDuration pause(int pauseDuration, @Nullable Long maxPauseWaitTime)
return new PausedDuration(startTime, endTime);
}

private void unpauseWithRetry(
RequestCoordinator coordinator, int maxRetryCount, TargetSnapshot target) {
@VisibleForTesting
void unpauseWithRetry(RequestCoordinator coordinator, int maxRetryCount, TargetSnapshot target)
throws PauserException {
int retryCounter = 0;

while (true) {
Expand All @@ -123,12 +138,23 @@ private void unpauseWithRetry(
return;
} catch (Exception e) {
if (++retryCounter >= maxRetryCount) {
logger.warn(
// If someone uses this library directly instead of using our CLI, users should handle the
// exception properly. However, this case is a very critical issue. Therefore, we output
// the error message here despite whether the exception is handled or not on the caller
// side.
logger.error(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages are same? If so, how about sharing it like this?

        if (++retryCounter >= maxRetryCount) {
          String errorMessage = String.format(
              "Failed to unpause Scalar product. They are still in paused. You must restart"
                  + " related pods by using the `kubectl rollout restart deployment %s` command"
                  + " to unpause all pods.",
              target.getDeployment().getMetadata().getName());
          logger.error(errorMessage, e);
          throw new PauserException(errorMessage, e);
        }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, passing the exception to logger.error() might be helpful for investigation?

"Failed to unpause Scalar product. They are still in paused. You must restart related"
+ " pods by using the `kubectl rollout restart deployment {}`"
+ " command to unpase all pods.",
+ " command to unpause all pods.",
target.getDeployment().getMetadata().getName());
return;
// In our CLI, we catch this exception and output the message as an error on the CLI side.
throw new PauserException(
String.format(
"Failed to unpause Scalar product. They are still in paused. You must restart"
+ " related pods by using the `kubectl rollout restart deployment %s` command"
+ " to unpause all pods.",
target.getDeployment().getMetadata().getName()),
e);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unpause() failed three times, throw the PauserException. The caller of the pause() method should handle this exception.

I will update the message after I implement the pauser.unpause() method. In other words, I will add a new option (or subcommand) that allows users to run the unpause operation solely, in another PR.

After I implement such a new feature, users can use it to unpause Scalar products instead of using the kubectl rollout restart deployment command.

}
}
}
Expand Down
Loading