Skip to content

Update pauser to handle exception and run unpause as long as possible #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 20, 2025
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
92ebb3a
Update pauser to handle exception and run unpause as long as possible
kota2and3kan Apr 25, 2025
07d8325
[WIP] Add several unit tests
kota2and3kan Apr 25, 2025
08312b2
[WIP] Add several unit tests
kota2and3kan Apr 28, 2025
34a03fc
Fix minor things
kota2and3kan Apr 28, 2025
25c10f6
Merge remote-tracking branch 'origin/main' into add-error-handling-wh…
kota2and3kan Apr 30, 2025
4d22a34
Update exception handling based on review feedback
kota2and3kan May 1, 2025
42f1a30
Update pauser logic based on feedback and discussion
kota2and3kan May 2, 2025
1d17af9
Update unit tests
kota2and3kan May 7, 2025
f087077
Update dummy object name variable to static constant
kota2and3kan May 7, 2025
d3b54c0
Add exception handling for getTarget() after pause
kota2and3kan May 8, 2025
55ba29c
Apply suggestions from code review
kota2and3kan May 8, 2025
891beb8
Change variable name from backupOk to isPauseOk
kota2and3kan May 8, 2025
c233242
Update name of target status variable and method
kota2and3kan May 8, 2025
744da27
Return PausedDuration from pauseInternal() method
kota2and3kan May 8, 2025
c413386
Move variable definition in catch block
kota2and3kan May 8, 2025
600e7b1
Make the logic a bit simpler
kota2and3kan May 9, 2025
9e9132d
Update error message generation and definition based on feedback
kota2and3kan May 9, 2025
909e596
Add code comments
kota2and3kan May 9, 2025
f9dc445
Implement buildException() based on feedback
kota2and3kan May 13, 2025
5cca44c
Run targetStatusEquals() only if targetAfterPause is not null
kota2and3kan May 14, 2025
639e132
Apply suggestions from code review
kota2and3kan May 15, 2025
b93c054
Fix code comments based on feedback
kota2and3kan May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies {
testImplementation(platform("org.junit:junit-bom:${junitVersion}"))
testImplementation 'org.junit.jupiter:junit-jupiter'
testImplementation "org.mockito:mockito-core:${mockitoVersion}"
testImplementation "org.mockito:mockito-inline:${mockitoVersion}"
}

test {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
package com.scalar.admin.kubernetes;

public class PauseFailedException extends PauserException {
public PauseFailedException(String message) {
super(message);
}

public PauseFailedException(String message, Throwable cause) {
super(message, cause);
}
}
166 changes: 134 additions & 32 deletions lib/src/main/java/com/scalar/admin/kubernetes/Pauser.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package com.scalar.admin.kubernetes;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.util.concurrent.Uninterruptibles;
import com.scalar.admin.RequestCoordinator;
import io.kubernetes.client.openapi.Configuration;
Expand All @@ -13,8 +14,6 @@
import java.util.stream.Collectors;
import javax.annotation.Nullable;
import javax.annotation.concurrent.NotThreadSafe;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
* This class implements a pause operation for Scalar product pods in a Kubernetes cluster. The
Expand All @@ -34,10 +33,11 @@
@NotThreadSafe
public class Pauser {

private static final int MAX_UNPAUSE_RETRY_COUNT = 3;
@VisibleForTesting static final int MAX_UNPAUSE_RETRY_COUNT = 3;

private final Logger logger = LoggerFactory.getLogger(Pauser.class);
private final TargetSelector targetSelector;
private Instant startTime;
private Instant endTime;

/**
* @param namespace The namespace where the pods are deployed.
Expand Down Expand Up @@ -77,67 +77,169 @@ public PausedDuration pause(int pauseDuration, @Nullable Long maxPauseWaitTime)
"pauseDuration is required to be greater than 0 millisecond.");
}

TargetSnapshot target;
// Get pods and deployment information before pause.
TargetSnapshot targetBeforePause;
try {
target = targetSelector.select();
targetBeforePause = getTarget();
} catch (Exception e) {
throw new PauserException("Failed to find the target pods to pause.", e);
}

RequestCoordinator coordinator = getRequestCoordinator(target);

coordinator.pause(true, maxPauseWaitTime);

Instant startTime = Instant.now();

Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);
// Get RequestCoordinator of Scalar Admin to pause.
RequestCoordinator requestCoordinator;
try {
requestCoordinator = getRequestCoordinator(targetBeforePause);
} catch (Exception e) {
throw new PauserException("Failed to initialize the request coordinator.", e);
}

Instant endTime = Instant.now();
// Run pause operation.
PauseFailedException pauseFailedException = null;
try {
pauseInternal(requestCoordinator, pauseDuration, maxPauseWaitTime);
} catch (Exception e) {
pauseFailedException = new PauseFailedException("Pause operation failed.", e);
}

unpauseWithRetry(coordinator, MAX_UNPAUSE_RETRY_COUNT, target);
// Run unpause operation.
UnpauseFailedException unpauseFailedException = null;
try {
unpauseWithRetry(requestCoordinator, MAX_UNPAUSE_RETRY_COUNT);
} catch (Exception e) {
unpauseFailedException = new UnpauseFailedException("Unpause operation failed.", e);
}

// Prepare error messages for each process.
String unpauseErrorMessage =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be defined as member variables since they don't need to be instantiated every execution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. That's right. They are fixed values.
I will update it. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9e9132d.

String.format(
"Unpause operation failed. Scalar products might still be in a paused state. You"
+ " must restart related pods by using the `kubectl rollout restart deployment"
+ " %s` command to unpause all pods. ",
targetBeforePause.getDeployment().getMetadata().getName());
String pauseErrorMessage =
"Pause operation failed. You cannot use the backup that was taken during this pause"
+ " duration. You need to retry the pause operation from the beginning to"
+ " take a backup. ";
String getTargetAfterPauseErrorMessage =
"Failed to find the target pods to examine if the targets pods were updated during"
+ " paused. ";
String statusCheckErrorMessage =
"Status check failed. You cannot use the backup that was taken during this pause"
+ " duration. You need to retry the pause operation from the beginning to"
+ " take a backup. ";
String statusDifferentErrorMessage =
"The target pods were updated during the pause duration. You cannot use the backup that"
+ " was taken during this pause duration. ";
String unpauseFailedButBackupOkErrorMessage =
String.format(
"Note that the pause operations for taking backup succeeded. You can use a backup that"
+ " was taken during this pause duration: Start Time = %s, End Time = %s. ",
startTime, endTime);

// Get pods and deployment information after pause.
TargetSnapshot targetAfterPause;
PauserException pauserException;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this can be moved to the catch block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I can move it to the catch block.
I will update it. Thank you for pointing it out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c413386.

try {
targetAfterPause = targetSelector.select();
targetAfterPause = getTarget();
} catch (Exception e) {
throw new PauserException(
"Failed to find the target pods to examine if the targets pods were updated during"
+ " paused.",
e);
pauserException = new PauserException(getTargetAfterPauseErrorMessage, e);
if (unpauseFailedException == null) {
throw pauserException;
} else {
throw new UnpauseFailedException(unpauseErrorMessage, pauserException);
}
}

if (!target.getStatus().equals(targetAfterPause.getStatus())) {
throw new PauserException("The target pods were updated during paused. Please retry.");
// Check if pods and deployment information are the same between before pause and after pause.
boolean compareTargetSuccessful;
StatusCheckFailedException statusCheckFailedException;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved to the catch block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can move it to the catch block.
I will do it. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c413386.

try {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about executing this block only when targetAfterPause != null? The method seems to return StatusUnmatchedException when the argument after is null, but in this case comparing before and after itself isn't conducted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. That's right. I will fix it. Thank you for the suggestion!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5cca44c.

compareTargetSuccessful = compareTargetStatus(targetBeforePause, targetAfterPause);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checking if two objects are the same, I think it's common to use xxxEquals.
compareXXX usually returns int to express non-equal cases.

Suggested change
compareTargetSuccessful = compareTargetStatus(targetBeforePause, targetAfterPause);
isTargetStatusEqual = targetStatusEquals(targetBeforePause, targetAfterPause);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I didn't know that.
I will update it. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c233242.

} catch (Exception e) {
statusCheckFailedException = new StatusCheckFailedException(statusCheckErrorMessage, e);
if (unpauseFailedException == null) {
throw statusCheckFailedException;
} else {
throw new UnpauseFailedException(unpauseErrorMessage, statusCheckFailedException);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like compareTargetStatus() doesn't throw any exceptions. Do we still need this exception handling?

Maybe we need the exception handling for targetAfterPause here?
https://github.com/scalar-labs/scalar-admin-for-kubernetes/pull/33/files#diff-529e2e818d31d239079b031928623cab070acd5743e317b3d709c34e6e52b394R114-R121

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like compareTargetStatus() doesn't throw any exceptions. Do we still need this exception handling?

For safety, I think it would be better to handle exceptions here. There is a possibility that some unexpected exceptions might occur here, and it could cause the issue keeping unpause state.

Maybe we need the exception handling for targetAfterPause here? https://github.com/scalar-labs/scalar-admin-for-kubernetes/pull/33/files#diff-529e2e818d31d239079b031928623cab070acd5743e317b3d709c34e6e52b394R114-R121

As I mentioned in another comment, you're right. I will add the exception handling there. I overlooked it. Thank you for pointing it out!

However, as mentioned above, I think we can keep this exception handling here. What do you think?


return new PausedDuration(startTime, endTime);
// If both the pause operation and status check succeeded, you can use the backup that was taken
// during the pause duration.
boolean backupOk = (pauseFailedException == null) && compareTargetSuccessful;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean backupOk = (pauseFailedException == null) && compareTargetSuccessful;
boolean isPauseOk = (pauseFailedException == null) && compareTargetSuccessful;

It's minor, but it's common to use isXXX for methods returning a boolean.
Also, it only pauses, and it doesn't mean the actual backup is successful, so it's probably misleading to call backup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Indeed, it does not mean Backup succeeded. I will update the variable name.
Thank you for your suggestion!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 891beb8.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable if we identify the cases here like the following and create error messages for each case.

boolean isAllOK = ...
boolean isPauseOkUnpauseNotOk = ...
boolean isPauseNotOkUnpauseOk = ...
boolean isPauseNotOkUnpauseNotOk = ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Indeed, your suggestion is more readable.

However, in the previous meeting, we (@komamitsu and I) discussed using boolean flags, and the result was Don't use flags (boolean variables) as long as possible to reduce status/variables so that we can make code simpler. So, I used the exceptions in the condition of the if statement.

Honestly, it's a bit difficult for me to decide which is better because I think both (more readable and simpler) make sense to me...

@komamitsu
What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean isPauseOkUnpauseNotOk = ...

This may improve the readability. But it'd also introduce many states. So, I think it's a trade-off (or maybe matter of taste?).

The current code uses unpauseFailedException, pauseFailedException and compareTargetSuccessful as more informative alternative to simple boolean flags. If we introduces new flags flags like isPauseNotOkUnpauseOkCompareTargetNotOk (or state enum like enum PauseResult { SUCCESS, PAUSE_SUCCESS_AND_UNPAUSE_FAILED_AND_COMPARE_TARGET_SUCCESS, ... }?), we need to manage 8 (2 ** 3) states and it would make the method complex. So, I suggested to separately manage the 3 states as a simple way.

@kota2and3kan I think either is fine with us. I'm not insisting on my suggestion. You can go with any option you feel most confident in and comfortable being responsible for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I got it.
I will consider which is better.
Thank you for your comment!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the discussion in the meeting!
Based on the result of the discussion, I updated as follows:

  • Make the pause logic a bit simpler in 600e7b1 to reduce many conditions and branches.
  • Add code comments about our current decision in 909e596 for future development.

// Create error message if any of the operations failed.
StringBuilder errorMessageBuilder = new StringBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner and more consistent to create the error message in another method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree with your suggestion.
I will update it. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9e9132d.

if (unpauseFailedException != null) {
errorMessageBuilder.append(unpauseErrorMessage);
if (backupOk) {
errorMessageBuilder.append(unpauseFailedButBackupOkErrorMessage);
}
}
if (pauseFailedException != null) {
errorMessageBuilder.append(pauseErrorMessage);
}
if (!compareTargetSuccessful) {
errorMessageBuilder.append(statusDifferentErrorMessage);
}
String errorMessage = errorMessageBuilder.toString();

// Return the final result based on each process.
if (unpauseFailedException
!= null) { // Unpause issue is the most critical because it might cause system down.
throw new UnpauseFailedException(errorMessage, unpauseFailedException);
} else if (pauseFailedException
!= null) { // Pause Failed is second priority because pause issue might be caused by
// configuration error.
throw new PauseFailedException(errorMessage, pauseFailedException);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first PauseFailedException instance contains a PauseFailedException instance as a cause. Is it expected? How about making pauseFailedException just Exception not PauseFailedException?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@komamitsu
Ah, yes. You're right. I think we can throw the pauseFailedException as is instead of new PauseFailedException().

I will fix it. I think we can fix this issue on the buildException() side that we are discussing in another comment.

Thank you for pointing it out!

} else if (!compareTargetSuccessful) { // Status check failed is third priority because this
// issue might be caused by temporary issue, for example, pod restarts.
throw new PauseFailedException(errorMessage);
} else { // All operations succeeded.
return new PausedDuration(startTime, endTime);
}
}

private void unpauseWithRetry(
RequestCoordinator coordinator, int maxRetryCount, TargetSnapshot target) {
@VisibleForTesting
void unpauseWithRetry(RequestCoordinator coordinator, int maxRetryCount) {
int retryCounter = 0;

while (true) {
try {
coordinator.unpause();
return;
} catch (Exception e) {
if (++retryCounter >= maxRetryCount) {
logger.warn(
"Failed to unpause Scalar product. They are still in paused. You must restart related"
+ " pods by using the `kubectl rollout restart deployment {}`"
+ " command to unpase all pods.",
target.getDeployment().getMetadata().getName());
return;
throw e;
}
}
}
}

@VisibleForTesting
TargetSnapshot getTarget() throws PauserException {
return targetSelector.select();
}
Comment on lines +202 to +205
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in another comment, I separated this targetSelector.select() as one method getTarget() for testing.


@VisibleForTesting
RequestCoordinator getRequestCoordinator(TargetSnapshot target) {
return new RequestCoordinator(
target.getPods().stream()
.map(p -> new InetSocketAddress(p.getStatus().getPodIP(), target.getAdminPort()))
.collect(Collectors.toList()));
}

@VisibleForTesting
void pauseInternal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better (more modular) to return PauseDuration as a result instead of updating member variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It looks good!
I will update it. Thank you for your suggestion!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 744da27.

RequestCoordinator requestCoordinator, int pauseDuration, @Nullable Long maxPauseWaitTime) {

requestCoordinator.pause(true, maxPauseWaitTime);
startTime = Instant.now();
Uninterruptibles.sleepUninterruptibly(pauseDuration, TimeUnit.MILLISECONDS);
endTime = Instant.now();
}

@VisibleForTesting
boolean compareTargetStatus(TargetSnapshot before, TargetSnapshot after) {
return before.getStatus().equals(after.getStatus());
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
package com.scalar.admin.kubernetes;

public class StatusCheckFailedException extends PauserException {
public StatusCheckFailedException(String message) {
super(message);
}

public StatusCheckFailedException(String message, Throwable cause) {
super(message, cause);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
package com.scalar.admin.kubernetes;

public class UnpauseFailedException extends PauserException {
public UnpauseFailedException(String message) {
super(message);
}

public UnpauseFailedException(String message, Throwable cause) {
super(message, cause);
}
}
Loading