Skip to content

Conversation

@joaofbantunes
Copy link
Member

Closes #44, closes #35

This PR improves error handling for polling message production, by identifying the stage at which errors happen, then applying different logic depending on it.

Before

Before, any error occurring during message production (other than completion errors, recently worked on #45), would simply be caught, logged and polling would then proceed normally (with the typical delay between polling attempts).

Depending on the service usage patterns, this could have different, but in any case non-ideal results:

  • high throughput - new messages available for production at short regular intervals, using the polling trigger optimization for reduced latency
    • in this scenario, the producer would be constantly triggered, never effectively applying the polling interval, which could cause additional stress on potentially struggling infrastructure
  • low throughput - new messages available for production occasionally, at larger intervals
    • in this scenario, upon failure, retry would potentially only occur after the polling interval elapsed, increasing significantly the production latency

The first scenario is the most problematic, as it could contribute to further issues in the infrastructure, including making it harder to recover. Still, the second scenario is also not ideal, so it's relevant to take both into consideration.

With this PR

This PR addresses the aforementioned issues by better identifying the result of the production attempt, including the stage at which potential errors occur, then applying different logic depending on it:

  • if message production runs without issues, the polling interval/wait for trigger logic is applied as usual
  • if message production fails at completion stage, then completion keeps being retried, with increasing backoff intervals* until it succeeds (implemented in Introduce mechanism to collect and retry completing successfully produced messages #45)
  • if message production fails at any other stage, then the typical polling interval/wait for trigger logic is replaced with suspending polling for an increasing interval*, which resets upon successful production, as well as when the stage at which the error occurred changes

* Note that the the interval increase is capped. With this PR the intervals are set at 1s, 5s, 30s, 1m and 5m, being this last one used until reset. This might change in the future, or even be made configurable, but for now, should be good enough.

@joaofbantunes joaofbantunes requested a review from Copilot June 29, 2025 15:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves error handling in the polling message production flow by distinguishing the error stage and applying different retry or backoff strategies accordingly. Key changes include:

  • Enhancements to the polling producer and background service logic with a switch-case based approach for error handling.
  • Updates to metrics reporting and registration using new metrics types (e.g., PollingProducerMetrics, PollingBackgroundServiceMetrics).
  • Adjustments in test cases to validate the new backoff calculator and updated error handling logic.

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/Core.Tests/Polling/ProduceIssueBackoffCalculatorTests.cs Added tests for the new backoff calculation logic.
tests/Core.Tests/Polling/PollingProducerTests.cs Updated tests to use the new PollingProducerMetrics type.
tests/Core.Tests/Polling/PollingBackgroundServiceTests.cs Refactored instantiation to use CreateSut and verified producer invocation timing.
src/Core/ServiceCollectionExtensions.cs Registered updated metrics services for polling and cleanup functionalities.
src/Core/Polling/PollingProducer.cs Reworked batch production flow with explicit error handling using an enum-based switch.
src/Core/Polling/PollingBackgroundService.cs Introduced switch-case based handling for different ProducePendingResult cases with corresponding backoff logic.
src/Core/CompletionRetrier.cs Changed log level for retry failures from Warning to Error.
src/Core/OpenTelemetry/* Updated metrics reporting classes to reflect new naming and tagging conventions.
samples/mysql/* & docs/observability/* Updated dashboard and documentation to reflect new metrics naming conventions.
Comments suppressed due to low confidence (2)

src/Core/Polling/CompletionRetrier.cs:83

  • The logging level for retry failures was changed from Warning to Error. Confirm that this change is intentional as it may affect alerting behavior in production.
    [LoggerMessage(LogLevel.Error,

src/Core/Polling/PollingProducer.cs:118

  • [nitpick] The explicit use of CancellationToken.None here is intentional to allow batch completion even during shutdown. Ensure this behavior aligns with overall application shutdown requirements.
                await batchContext.CompleteAsync(batchProduceResult.Ok, CancellationToken.None);

@joaofbantunes joaofbantunes changed the title Improve polling error handling Improve polling producer error handling Jun 29, 2025
@joaofbantunes joaofbantunes merged commit d209c1e into main Jul 7, 2025
1 check passed
@joaofbantunes joaofbantunes deleted the improve-polling-error-handling branch July 7, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants