Skip to content

Update pauser to handle exception and run unpause as long as possible #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 20, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
92ebb3a
Update pauser to handle exception and run unpause as long as possible
kota2and3kan Apr 25, 2025
07d8325
[WIP] Add several unit tests
kota2and3kan Apr 25, 2025
08312b2
[WIP] Add several unit tests
kota2and3kan Apr 28, 2025
34a03fc
Fix minor things
kota2and3kan Apr 28, 2025
25c10f6
Merge remote-tracking branch 'origin/main' into add-error-handling-wh…
kota2and3kan Apr 30, 2025
4d22a34
Update exception handling based on review feedback
kota2and3kan May 1, 2025
42f1a30
Update pauser logic based on feedback and discussion
kota2and3kan May 2, 2025
1d17af9
Update unit tests
kota2and3kan May 7, 2025
f087077
Update dummy object name variable to static constant
kota2and3kan May 7, 2025
d3b54c0
Add exception handling for getTarget() after pause
kota2and3kan May 8, 2025
55ba29c
Apply suggestions from code review
kota2and3kan May 8, 2025
891beb8
Change variable name from backupOk to isPauseOk
kota2and3kan May 8, 2025
c233242
Update name of target status variable and method
kota2and3kan May 8, 2025
744da27
Return PausedDuration from pauseInternal() method
kota2and3kan May 8, 2025
c413386
Move variable definition in catch block
kota2and3kan May 8, 2025
600e7b1
Make the logic a bit simpler
kota2and3kan May 9, 2025
9e9132d
Update error message generation and definition based on feedback
kota2and3kan May 9, 2025
909e596
Add code comments
kota2and3kan May 9, 2025
f9dc445
Implement buildException() based on feedback
kota2and3kan May 13, 2025
5cca44c
Run targetStatusEquals() only if targetAfterPause is not null
kota2and3kan May 14, 2025
639e132
Apply suggestions from code review
kota2and3kan May 15, 2025
b93c054
Fix code comments based on feedback
kota2and3kan May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies {
testImplementation(platform("org.junit:junit-bom:${junitVersion}"))
testImplementation 'org.junit.jupiter:junit-jupiter'
testImplementation "org.mockito:mockito-core:${mockitoVersion}"
testImplementation "org.mockito:mockito-inline:${mockitoVersion}"
}

test {
Expand Down
71 changes: 57 additions & 14 deletions lib/src/main/java/com/scalar/admin/kubernetes/Pauser.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package com.scalar.admin.kubernetes;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.util.concurrent.Uninterruptibles;
import com.scalar.admin.RequestCoordinator;
import io.kubernetes.client.openapi.Configuration;
Expand Down Expand Up @@ -34,7 +35,7 @@
@NotThreadSafe
public class Pauser {

private static final int MAX_UNPAUSE_RETRY_COUNT = 3;
@VisibleForTesting static final int MAX_UNPAUSE_RETRY_COUNT = 3;

private final Logger logger = LoggerFactory.getLogger(Pauser.class);
private final TargetSelector targetSelector;
Expand Down Expand Up @@ -79,26 +80,50 @@ public PausedDuration pause(int pauseDuration, @Nullable Long maxPauseWaitTime)

TargetSnapshot target;
try {
target = targetSelector.select();
target = getTarget();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated this targetSelector.select() as one method getTarget() for testing. This targetSelector is initialized in the constructor of Pauser. So, controlling the behavior of targetSelector by using Mockito is a bit complex and difficult. That's why I separate this function.

} catch (Exception e) {
throw new PauserException("Failed to find the target pods to pause.", e);
}

RequestCoordinator coordinator = getRequestCoordinator(target);
RequestCoordinator coordinator;
try {
coordinator = getRequestCoordinator(target);
} catch (Exception e) {
throw new PauserException("Failed to initialize the coordinator.", e);
}
Copy link
Collaborator Author

@kota2and3kan kota2and3kan Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added exception handling for the initialization process of RequestCoodinator.


coordinator.pause(true, maxPauseWaitTime);
Instant startTime;
Instant endTime;
try {
coordinator.pause(true, maxPauseWaitTime);

Instant startTime = Instant.now();
startTime = Instant.now();

Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);
Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);

Instant endTime = Instant.now();
endTime = Instant.now();

unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);
unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);

} catch (Exception e) {
try {
unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);
} catch (PauserException ex) {
throw new PauserException("unpauseWithRetry() method failed twice.", e);
} catch (Exception ex) {
throw new PauserException(
"unpauseWithRetry() method failed twice due to unexpected exception.", e);
}
throw new PauserException(
"The pause operation failed for some reason. However, the unpause operation succeeded"
+ " afterward. Currently, the scalar products are running with the unpause status."
+ " You should retry the pause operation to ensure proper backup.",
e);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main update of this PR.

Previously, there was no exception handling. So, if some error occurred, this library didn't try to run unpause(). This might cause the Scalar products to keep the pause state. In other words, it might cause like system down.

To address such issue, I added the exception handling here. In this implementation, if some errors occur, this library tries to run the unpause() operation.


TargetSnapshot targetAfterPause;
try {
targetAfterPause = targetSelector.select();
targetAfterPause = getTarget();
} catch (Exception e) {
throw new PauserException(
"Failed to find the target pods to examine if the targets pods were updated during"
Expand All @@ -113,8 +138,9 @@ public PausedDuration pause(int pauseDuration, @Nullable Long maxPauseWaitTime)
return new PausedDuration(startTime, endTime);
}

private void unpauseWithRetry(
RequestCoordinator coordinator, int maxRetryCount, TargetSnapshot target) {
@VisibleForTesting
void unpauseWithRetry(RequestCoordinator coordinator, int maxRetryCount, TargetSnapshot target)
throws PauserException {
int retryCounter = 0;

while (true) {
Expand All @@ -123,17 +149,34 @@ private void unpauseWithRetry(
return;
} catch (Exception e) {
if (++retryCounter >= maxRetryCount) {
logger.warn(
// Users who directly utilize this library, bypassing our CLI, are responsible for proper
// exception handling. However, this scenario represents a critical issue. Consequently,
// we output the error message here regardless of whether the calling code handles the
// exception.
logger.error(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages are same? If so, how about sharing it like this?

        if (++retryCounter >= maxRetryCount) {
          String errorMessage = String.format(
              "Failed to unpause Scalar product. They are still in paused. You must restart"
                  + " related pods by using the `kubectl rollout restart deployment %s` command"
                  + " to unpause all pods.",
              target.getDeployment().getMetadata().getName());
          logger.error(errorMessage, e);
          throw new PauserException(errorMessage, e);
        }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, passing the exception to logger.error() might be helpful for investigation?

"Failed to unpause Scalar product. They are still in paused. You must restart related"
+ " pods by using the `kubectl rollout restart deployment {}`"
+ " command to unpase all pods.",
+ " command to unpause all pods.",
target.getDeployment().getMetadata().getName());
return;
// In our CLI, we catch this exception and output the message as an error on the CLI side.
throw new PauserException(
String.format(
"Failed to unpause Scalar product. They are still in paused. You must restart"
+ " related pods by using the `kubectl rollout restart deployment %s` command"
+ " to unpause all pods.",
target.getDeployment().getMetadata().getName()),
e);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unpause() failed three times, throw the PauserException. The caller of the pause() method should handle this exception.

I will update the message after I implement the pauser.unpause() method. In other words, I will add a new option (or subcommand) that allows users to run the unpause operation solely, in another PR.

After I implement such a new feature, users can use it to unpause Scalar products instead of using the kubectl rollout restart deployment command.

}
}
}
}

@VisibleForTesting
TargetSnapshot getTarget() throws PauserException {
return targetSelector.select();
}
Comment on lines +202 to +205
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in another comment, I separated this targetSelector.select() as one method getTarget() for testing.


@VisibleForTesting
RequestCoordinator getRequestCoordinator(TargetSnapshot target) {
return new RequestCoordinator(
target.getPods().stream()
Expand Down
Loading
Loading