Skip to content

[telemetry] Detect service reachability issues #3426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Jun 19, 2025

[telemetry] Detect service reachability issues

Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing rate(restate.invoker.deployment_unreachable_errors.total) coupled with alerts, operator
can know when a service is facing connectivity problems


Stack created with Sapling. Best reviewed with ReviewStack.

@muhamadazmy
Copy link
Contributor Author

image

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great observability addition, thank you @muhamadazmy! Left one minor naming comment which you should feel free to ignore.

pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
pub const INVOKER_SERVICE_DOWN_ERRORS: &str = "restate.invoker.service_down_errors.total";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky naming observation: Wondering if "unavailable" or "unreachable" might not be more accurate than "down" - since we can't tell authoritatively that it's really down, just that it's not available from our point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very nice. Thank you. Will apply :)

@@ -21,6 +21,8 @@ pub const INVOKER_TASK_DURATION: &str = "restate.invoker.task_duration.seconds";
pub const INVOKER_SERVICE_RESPONSE_TIME: &str = "restate.invoker.service_response_time.seconds";
pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
pub const INVOKER_SERVICE_UNREACHABLE_ERRORS: &str =
"restate.invoker.service_unreachable_errors.total";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean deployment

@@ -1090,6 +1090,11 @@ where
.remove_invocation_with_epoch(partition, &invocation_id, invocation_epoch)
{
debug_assert_eq!(invocation_epoch, ism.invocation_epoch);

if self.is_service_down_error(&error) {
counter!(INVOKER_SERVICE_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string()).increment(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have both the deployment id and the service name. The risk is that it this will be a high cardinality metric.

@muhamadazmy muhamadazmy changed the title [telemetry] Detect service rechability issues [telemetry] Detect service reachability issues Jun 23, 2025
This is used in the next PR to associate the deployment unreachable
failure with the deployment id
Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing `rate(restate.invoker.deployment_unreachable_errors.total)` coupled with alerts, operator
can know when a service is facing connectivity problems
@muhamadazmy
Copy link
Contributor Author

@AhmedSoliman Thank you so much for your review. I applied the required changes, mainly to include the deployment id in the metric. This was not available on the error itself this is why I created the change at #3439 to be able do this. Would be great if you can also review it.

While this metric might have a high cardinality, I don't believe it will be too crazy. Specially when older deployments are completely gone

@muhamadazmy muhamadazmy requested a review from AhmedSoliman June 23, 2025 08:03
Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks @muhamadazmy!

describe_counter!(
INVOKER_DEPLOYMENT_UNREACHABLE_ERRORS,
Unit::Count,
"Number of deployment down errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Number of deployment down errors"
"Number of service deployment unreachable errors"

.deployment_id
.map(|id| Cow::Owned(id.to_string()))
.unwrap_or_else(|| Cow::Borrowed("unknown"));
counter!(INVOKER_DEPLOYMENT_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string(), "deployment" => deployment_id).increment(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can already detect with the new fttb metric, or the invocation task status metric or does it require a new metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good idea. I will check if it can be done as part of the ttfb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants