Improve polling producer error handling #47

joaofbantunes · 2025-06-29T15:23:57Z

Closes #44, closes #35

This PR improves error handling for polling message production, by identifying the stage at which errors happen, then applying different logic depending on it.

Before

Before, any error occurring during message production (other than completion errors, recently worked on #45), would simply be caught, logged and polling would then proceed normally (with the typical delay between polling attempts).

Depending on the service usage patterns, this could have different, but in any case non-ideal results:

high throughput - new messages available for production at short regular intervals, using the polling trigger optimization for reduced latency
- in this scenario, the producer would be constantly triggered, never effectively applying the polling interval, which could cause additional stress on potentially struggling infrastructure
low throughput - new messages available for production occasionally, at larger intervals
- in this scenario, upon failure, retry would potentially only occur after the polling interval elapsed, increasing significantly the production latency

The first scenario is the most problematic, as it could contribute to further issues in the infrastructure, including making it harder to recover. Still, the second scenario is also not ideal, so it's relevant to take both into consideration.

With this PR

This PR addresses the aforementioned issues by better identifying the result of the production attempt, including the stage at which potential errors occur, then applying different logic depending on it:

if message production runs without issues, the polling interval/wait for trigger logic is applied as usual
if message production fails at completion stage, then completion keeps being retried, with increasing backoff intervals* until it succeeds (implemented in Introduce mechanism to collect and retry completing successfully produced messages #45)
if message production fails at any other stage, then the typical polling interval/wait for trigger logic is replaced with suspending polling for an increasing interval*, which resets upon successful production, as well as when the stage at which the error occurred changes

* Note that the the interval increase is capped. With this PR the intervals are set at 1s, 5s, 30s, 1m and 5m, being this last one used until reset. This might change in the future, or even be made configurable, but for now, should be good enough.

Copilot

Pull Request Overview

This PR improves error handling in the polling message production flow by distinguishing the error stage and applying different retry or backoff strategies accordingly. Key changes include:

Enhancements to the polling producer and background service logic with a switch-case based approach for error handling.
Updates to metrics reporting and registration using new metrics types (e.g., PollingProducerMetrics, PollingBackgroundServiceMetrics).
Adjustments in test cases to validate the new backoff calculator and updated error handling logic.

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/Core.Tests/Polling/ProduceIssueBackoffCalculatorTests.cs	Added tests for the new backoff calculation logic.
tests/Core.Tests/Polling/PollingProducerTests.cs	Updated tests to use the new PollingProducerMetrics type.
tests/Core.Tests/Polling/PollingBackgroundServiceTests.cs	Refactored instantiation to use CreateSut and verified producer invocation timing.
src/Core/ServiceCollectionExtensions.cs	Registered updated metrics services for polling and cleanup functionalities.
src/Core/Polling/PollingProducer.cs	Reworked batch production flow with explicit error handling using an enum-based switch.
src/Core/Polling/PollingBackgroundService.cs	Introduced switch-case based handling for different ProducePendingResult cases with corresponding backoff logic.
src/Core/CompletionRetrier.cs	Changed log level for retry failures from Warning to Error.
src/Core/OpenTelemetry/*	Updated metrics reporting classes to reflect new naming and tagging conventions.
samples/mysql/* & docs/observability/*	Updated dashboard and documentation to reflect new metrics naming conventions.

Comments suppressed due to low confidence (2)

src/Core/Polling/CompletionRetrier.cs:83

The logging level for retry failures was changed from Warning to Error. Confirm that this change is intentional as it may affect alerting behavior in production.

    [LoggerMessage(LogLevel.Error,

src/Core/Polling/PollingProducer.cs:118

[nitpick] The explicit use of CancellationToken.None here is intentional to allow batch completion even during shutdown. Ensure this behavior aligns with overall application shutdown requirements.

                await batchContext.CompleteAsync(batchProduceResult.Ok, CancellationToken.None);

joaofbantunes added 2 commits June 28, 2025 21:03

implementing message production resiliency improvements

70a9b5d

adjustments to logs, traces and metrics

abd43ce

joaofbantunes requested a review from Copilot June 29, 2025 15:24

Copilot AI reviewed Jun 29, 2025

View reviewed changes

joaofbantunes changed the title ~~Improve polling error handling~~ Improve polling producer error handling Jun 29, 2025

joaofbantunes mentioned this pull request Jun 30, 2025

Add more resiliency features #31

Open

joaofbantunes merged commit d209c1e into main Jul 7, 2025
1 check passed

joaofbantunes deleted the improve-polling-error-handling branch July 7, 2025 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve polling producer error handling #47

Improve polling producer error handling #47

Uh oh!

joaofbantunes commented Jun 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Improve polling producer error handling #47

Improve polling producer error handling #47

Uh oh!

Conversation

joaofbantunes commented Jun 29, 2025

Before

With this PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants