[telemetry] Detect service reachability issues #3426

muhamadazmy · 2025-06-19T14:07:10Z

[telemetry] Detect service reachability issues

Summary:
Introducing a counter for number of "rechability" issues for a service
that can detect a service is down or un-responsive

by visualizing rate(restate.invoker.deployment_unreachable_errors.total) coupled with alerts, operator
can know when a service is facing connectivity problems

Stack created with Sapling. Best reviewed with ReviewStack.

muhamadazmy · 2025-06-19T14:07:44Z

pcholakov

This is a great observability addition, thank you @muhamadazmy! Left one minor naming comment which you should feel free to ignore.

pcholakov · 2025-06-19T21:38:14Z

crates/invoker-impl/src/metric_definitions.rs

 pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
+pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
+pub const INVOKER_SERVICE_DOWN_ERRORS: &str = "restate.invoker.service_down_errors.total";


Nitpicky naming observation: Wondering if "unavailable" or "unreachable" might not be more accurate than "down" - since we can't tell authoritatively that it's really down, just that it's not available from our point of view.

Ah very nice. Thank you. Will apply :)

AhmedSoliman · 2025-06-20T15:39:23Z

crates/invoker-impl/src/metric_definitions.rs

@@ -21,6 +21,8 @@ pub const INVOKER_TASK_DURATION: &str = "restate.invoker.task_duration.seconds";
 pub const INVOKER_SERVICE_RESPONSE_TIME: &str = "restate.invoker.service_response_time.seconds";
 pub const INVOKER_TASKS_IN_FLIGHT: &str = "restate.invoker.inflight_tasks";
 pub const INVOKER_JOURNAL_REPLAY_TIME: &str = "restate.invoker.journal_replay_time.seconds";
+pub const INVOKER_SERVICE_UNREACHABLE_ERRORS: &str =
+    "restate.invoker.service_unreachable_errors.total";


I guess you mean deployment

AhmedSoliman · 2025-06-20T15:40:08Z

crates/invoker-impl/src/lib.rs

@@ -1090,6 +1090,11 @@ where
            .remove_invocation_with_epoch(partition, &invocation_id, invocation_epoch)
        {
            debug_assert_eq!(invocation_epoch, ism.invocation_epoch);
+
+            if self.is_service_down_error(&error) {
+                counter!(INVOKER_SERVICE_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string()).increment(1);


I think we need to have both the deployment id and the service name. The risk is that it this will be a high cardinality metric.

This is used in the next PR to associate the deployment unreachable failure with the deployment id

Summary: Introducing a counter for number of "rechability" issues for a service that can detect a service is down or un-responsive by visualizing `rate(restate.invoker.deployment_unreachable_errors.total)` coupled with alerts, operator can know when a service is facing connectivity problems

muhamadazmy · 2025-06-23T08:03:20Z

@AhmedSoliman Thank you so much for your review. I applied the required changes, mainly to include the deployment id in the metric. This was not available on the error itself this is why I created the change at #3439 to be able do this. Would be great if you can also review it.

While this metric might have a high cardinality, I don't believe it will be too crazy. Specially when older deployments are completely gone

pcholakov

Nice one, thanks @muhamadazmy!

pcholakov · 2025-06-23T09:53:13Z

crates/invoker-impl/src/metric_definitions.rs

+    describe_counter!(
+        INVOKER_DEPLOYMENT_UNREACHABLE_ERRORS,
+        Unit::Count,
+        "Number of deployment down errors"


Suggested change

"Number of deployment down errors"

"Number of service deployment unreachable errors"

AhmedSoliman · 2025-06-23T10:05:04Z

crates/invoker-impl/src/lib.rs

+                    .deployment_id
+                    .map(|id| Cow::Owned(id.to_string()))
+                    .unwrap_or_else(|| Cow::Borrowed("unknown"));
+                counter!(INVOKER_DEPLOYMENT_UNREACHABLE_ERRORS, "service" => ism.invocation_target.service_name().to_string(), "deployment" => deployment_id).increment(1);


Is this something we can already detect with the new fttb metric, or the invocation task status metric or does it require a new metric?

Hmm, good idea. I will check if it can be done as part of the ttfb

This was referenced Jun 19, 2025

Set histogram quantiles to [50, 90, 99, and 100%] #3425

Merged

[telemetry] TTFB (time to first byte) metric for deployments #3424

Open

muhamadazmy requested review from AhmedSoliman and pcholakov June 19, 2025 14:07

pcholakov approved these changes Jun 19, 2025

View reviewed changes

muhamadazmy force-pushed the pr3426 branch from 22d67c4 to d4be614 Compare June 20, 2025 07:59

AhmedSoliman reviewed Jun 20, 2025

View reviewed changes

muhamadazmy force-pushed the pr3426 branch from d4be614 to f40d1dc Compare June 23, 2025 07:42

muhamadazmy changed the title ~~[telemetry] Detect service rechability issues~~ [telemetry] Detect service reachability issues Jun 23, 2025

muhamadazmy mentioned this pull request Jun 23, 2025

[invoker] Attach deployment id to the invoker error #3439

Open

[invoker] Attach deployment id to the invoker error

3fbe382

This is used in the next PR to associate the deployment unreachable failure with the deployment id

muhamadazmy force-pushed the pr3426 branch from f40d1dc to c17ddd1 Compare June 23, 2025 07:57

muhamadazmy force-pushed the pr3426 branch from c17ddd1 to f18b19e Compare June 23, 2025 07:58

muhamadazmy requested a review from AhmedSoliman June 23, 2025 08:03

pcholakov approved these changes Jun 23, 2025

View reviewed changes

AhmedSoliman reviewed Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[telemetry] Detect service reachability issues #3426

[telemetry] Detect service reachability issues #3426

Uh oh!

muhamadazmy commented Jun 19, 2025 •

edited

Loading

Uh oh!

muhamadazmy commented Jun 19, 2025

Uh oh!

pcholakov left a comment

Uh oh!

pcholakov Jun 19, 2025

Uh oh!

muhamadazmy Jun 20, 2025

Uh oh!

AhmedSoliman Jun 20, 2025

Uh oh!

AhmedSoliman Jun 20, 2025

Uh oh!

muhamadazmy commented Jun 23, 2025

Uh oh!

pcholakov left a comment

Uh oh!

pcholakov Jun 23, 2025

Uh oh!

AhmedSoliman Jun 23, 2025

Uh oh!

muhamadazmy Jun 23, 2025

Uh oh!

Uh oh!

	"Number of deployment down errors"
	"Number of service deployment unreachable errors"

[telemetry] Detect service reachability issues #3426

Are you sure you want to change the base?

[telemetry] Detect service reachability issues #3426

Uh oh!

Conversation

muhamadazmy commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muhamadazmy commented Jun 19, 2025

Uh oh!

pcholakov left a comment

Choose a reason for hiding this comment

Uh oh!

pcholakov Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

muhamadazmy Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

muhamadazmy commented Jun 23, 2025

Uh oh!

pcholakov left a comment

Choose a reason for hiding this comment

Uh oh!

pcholakov Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

muhamadazmy Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muhamadazmy commented Jun 19, 2025 •

edited

Loading