Skip to content

feat(logs): add Buffering and Batching #4310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: feat/logs
Choose a base branch
from

Conversation

Flash0ver
Copy link
Member

Add minimal support for Buffering / Batching for Sentry Logs.
Planned follow-up to avoid having high-frequency logging drop other, potentially more important, envelopes: #4306

Changes:

  • SentryStructuredLogger is now IDisposable
    • and is Disposed when Hub is Disposed
  • the user-facing SentryLog is no longer ISentryJsonSerializable
  • but the new internal StructuredLog is ISentryJsonSerializable
    • this new internal type groups one or many SentryLog instances for serialization
  • the DefaultSentryStructuredLogger no longer directly captures and enqueues Log-Envelopes
    • but indirectly via the new BatchProcessor instead
  • new type BatchProcessor
    • buffers Logs
    • sends Logs when
      • Count-Threshold is reached
      • Timeout is reached
    • utilizes a Timer that is abstracted via BatchProcessorTimer to enable testing
  • using a new helper type, BatchBuffer, which is a slim wrapper over an Array
  • 2 new temporary Experimental SentryOptions
    • SentryOptions.InternalBatchSize controls the buffer's item count threshold
    • SentryOptions.InternalBatchTimeout controls the buffer's interval
    • both options are meant to be temporary configuration for applications to play with different values during dogfooding
    • these options are not in the specification and will be removed (or made internal) before removing [Experimental] of the Sentry Logs feature

#skip-changelog

@Flash0ver Flash0ver self-assigned this Jun 26, 2025
@Flash0ver Flash0ver requested review from bruno-garcia and jamescrosswell and removed request for jamescrosswell June 26, 2025 21:17
Copy link
Contributor

github-actions bot commented Jun 26, 2025

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against 0774709

/// Sentry Specification: <see href="https://develop.sentry.dev/sdk/telemetry/spans/batch-processor/"/>.
/// OpenTelemetry spec: <see href="https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md"/>.
/// </remarks>
internal sealed class BatchProcessor : IDisposable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could think about giving this a more specific/descriptive name. BatchProcessor is very generic and we could have other batch processors (for metrics or other things) in the future.

If we weren't worried about long names, I guess it would be a StructuredLogBatchProcessor... Or a more concise alternative might be LogBatch (so you'd do something like _logBatch.Enqueue(log))?

Alternatively, it could be made generic (like the BatchBuffer class that it uses), but I'd recommend we do that if/when we need a batch processor for a second category of Sentry events (it's hard to make something properly generic until you have 2 or 3 different concrete implementations).

{
private readonly IHub _hub;
private readonly BatchProcessorTimer _timer;
private readonly BatchBuffer<SentryLog> _logs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have a _lock and a _logs buffer here.

Any reason not to use a ConcurrentQueue or a ConcurrentQueueLite? It'd be good to have fewer classes to maintain (unless they're truly necessary).

private readonly IHub _hub;
private readonly BatchProcessorTimer _timer;
private readonly BatchBuffer<SentryLog> _logs;
private readonly Lock _lock;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want to control concurrency in this class (rather than using one of the build in data structures supporting concurrency), consider using something like ReaderWriterLockSlim... this is mainly useful when you have lots of reads and a few writes. However if most of the operations are writes then it doesn't buy us much/anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did indeed start with a more elaborate locking-strategy with ReaderWriterLockSlim.
And multi-targeted with ReaderWriterLockonNETFRAMEWORKwhereThread.Abort` is allowed, since this code path may be invoked from user code.

Since we do expect more writes than reads, I then simplified it to just a "full-lock-all" approach for now, with the option to optimize later.

I just did a super quick benchmark: A "flushing" call of EnqueueCore(SentryLog) with a capacity of 100 takes between 2 and 3 milliseconds.

I'll do a bit of experimenting and benchmarking over the weekend with different locking strategies.

_batchProcessor = new BatchProcessor(hub, ClampBatchCount(options.Experimental.InternalBatchSize), ClampBatchInterval(options.Experimental.InternalBatchTimeout));
}

private static int ClampBatchCount(int batchCount)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go in the setter for the InternalBatchSize on the options instead?

If an invalid value has been configured, should we either throw an exception (we'd only do that if it happened when initialising the options - we don't generally want the SDK to throw exceptions otherwise) or at the very least log a message letting people know they're not configuring things correctly?

: batchCount;
}

private static TimeSpan ClampBatchInterval(TimeSpan batchInterval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, could this be moved to the setter?

/// <remarks>
/// Threshold of items in the buffer when sending all items, regardless of <see cref="InternalBatchTimeout"/>.
/// </remarks>
public int InternalBatchSize { get; set; } = 100;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'll be removing this in future, presumably we need to decide what an appropriate "final" or "permanent" value for this will be. How are we going to determine that (and so what is the value in exposing this publicly, temporarily)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our docs/spec suggests, as a starting point, to use 100 items and 5 seconds: https://develop.sentry.dev/sdk/telemetry/logs/#buffering
But to use a more elaborate batching strategy later on, via the Batch Processor - currently work in progress - https://develop.sentry.dev/sdk/telemetry/spans/batch-processor/
The Batch Processor should also be used for span streaming.
The current guideline is: start with 5 seconds, with the option to go as high as 30 seconds, with an auto-flush of 1MiB in size.
For the 1MiB, we could utilize Utf8JsonWriter.BytesCommitted + Utf8JsonWriter.BytesPending, but that would require a bit of rewriting / redesigning our Background-Worker / Transport layer.

We could also keep these options, but make them internal, to be able to choose different values depending on the SDK / Integration.

I was thinking about - during the [Experimental] phase - trying to explore the best initial values during dogfooding.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends where we expect to collect the feedback that will inform ultimately how these values are determined.

If we're expecting SDK users to play around with different values and get back to us to let us know what the ideal values for these things are, then it makes sense for these options to be public.

From the docs, it doesn't look like that's not how we'll decide the values to be used here... so I reckon we either bake in the "simple approach to buffer" (described here) or implement the two triggers defined in the specification (5s or 1 MiB). The 1MiB trigger would require we calculate the size of each SentryLog and track the cumulative size of the logs in the StructuredLog.

Either way, I don't think we surface this up in the options to user. We could hold it in the state for the BatchProcessor instead.

@bruno-garcia
Copy link
Member

@sentry review

Copy link

On it! We are reviewing the PR and will provide feedback shortly.

Copy link

PR Description

This pull request introduces a batch processing mechanism for Sentry logs. The goal is to improve performance and reduce overhead by collecting log messages and sending them to Sentry in batches, rather than individually.

Click to see more

Key Technical Changes

  1. BatchBuffer Class: A new BatchBuffer<T> class is introduced to act as a fixed-size buffer for collecting log messages. It provides methods for adding items, checking if the buffer is full or empty, and converting the buffer to an array.
  2. BatchProcessor Class: A BatchProcessor class is implemented to manage the batching process. It uses a timer to periodically flush the buffer and send the logs to Sentry. It also flushes the buffer when it reaches its maximum capacity.
  3. BatchProcessorTimer Abstraction: An abstract BatchProcessorTimer class and its TimersBatchProcessorTimer implementation are added to abstract the timer functionality, allowing for different timer implementations.
  4. StructuredLog Class: A StructuredLog class is introduced to encapsulate an array of SentryLog items, representing a batch of logs to be sent to Sentry.
  5. Modified SentryLog Class: The SentryLog class is modified to remove the ISentryJsonSerializable interface implementation and the WriteTo method is made internal.
  6. Envelope Modifications: The Envelope and EnvelopeItem classes are modified to handle the new StructuredLog class, allowing for the creation of envelopes containing batches of logs.
  7. SentryOptions Configuration: The SentryOptions class is extended with InternalBatchSize and InternalBatchTimeout properties to configure the batch processing behavior.
  8. DefaultSentryStructuredLogger Modifications: The DefaultSentryStructuredLogger is modified to use the BatchProcessor for enqueueing logs, and the Dispose method is updated to dispose of the BatchProcessor.

Architecture Decisions

  1. Batching Mechanism: A timer-based batching approach was chosen to balance latency and throughput. Logs are sent either when the buffer is full or when a specified time interval has elapsed.
  2. Locking Strategy: A lock is used to synchronize access to the buffer and prevent race conditions. This ensures that log messages are added and flushed safely.
  3. Timer Abstraction: The BatchProcessorTimer abstraction allows for flexibility in choosing the timer implementation. The default implementation uses System.Timers.Timer, but other implementations could be used in different environments.
  4. StructuredLog Encapsulation: The StructuredLog class encapsulates the array of SentryLog items, providing a clear separation of concerns and simplifying the envelope creation process.

Dependencies and Interactions

  1. IHub: The BatchProcessor depends on the IHub interface to capture envelopes and send them to Sentry.
  2. SentryOptions: The DefaultSentryStructuredLogger relies on SentryOptions to configure the batch processing behavior.
  3. System.Timers.Timer: The TimersBatchProcessorTimer implementation uses the System.Timers.Timer class for scheduling the batch flushing.
  4. SentryLog: The StructuredLog class depends on the SentryLog class to represent individual log messages.

Risk Considerations

  1. Lock Contention: The lock used to synchronize access to the buffer could become a bottleneck under high load. Performance testing should be conducted to ensure that the locking strategy is efficient.
  2. Timer Accuracy: The accuracy of the System.Timers.Timer class can be affected by system load and other factors. This could lead to variations in the batch flushing interval.
  3. Unhandled Exceptions: Exceptions thrown during the flush operation or in event handlers could crash the timer thread. Exception handling should be implemented to prevent this.
  4. ObjectDisposedExceptions: The SentryStructuredLogger needs to be properly disposed to prevent ObjectDisposedExceptions when logging after disposal.

Notable Implementation Details

  1. The Debug.Assert statement in BatchProcessor.EnqueueCore is used to check for unexpected conditions. This assertion should be replaced with a more robust error handling mechanism in production code.
  2. The ClampBatchCount and ClampBatchInterval methods in DefaultSentryStructuredLogger are used to ensure that the batch size and interval are within reasonable bounds. These methods should be carefully reviewed to ensure that the limits are appropriate.
  3. The FakeBatchProcessorTimer class in the test suite uses reflection to create ElapsedEventArgs instances for .NET Framework. This approach is brittle and could break if the internal constructor changes. A more robust approach should be used for testing.

Comment on lines 49 to 53
internal void Clear()
{
if (_count == 0)
{
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Clear() method has an early return for empty buffers, but Array.Clear() is safe to call even on empty arrays. Consider removing the early return to simplify the code. However, if the early return is for performance reasons, add a comment explaining this optimization.

Suggested change
internal void Clear()
{
if (_count == 0)
{
return;
internal void Clear()
{
var count = _count;
_count = 0;
Array.Clear(_array, 0, count);
}

Comment on lines 58 to 59
var added = _logs.TryAdd(log);
Debug.Assert(added, $"Since we currently have no lock-free scenario, it's unexpected to exceed the {nameof(BatchBuffer<SentryLog>)}'s capacity.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Debug.Assert statement on line 59 could be problematic in release builds where assertions are disabled. If this is a critical invariant that must be maintained, consider throwing an exception instead or using a different approach to ensure the condition is always checked.

Suggested change
var added = _logs.TryAdd(log);
Debug.Assert(added, $"Since we currently have no lock-free scenario, it's unexpected to exceed the {nameof(BatchBuffer<SentryLog>)}'s capacity.");
if (!added)
{
throw new InvalidOperationException($"Failed to add item to {nameof(BatchBuffer<SentryLog>)}. This indicates a potential threading issue or buffer overflow.");
}

Comment on lines 79 to 90

private void IntervalElapsed(object? sender, ElapsedEventArgs e)
{
_timer.Enabled = false;

lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
Flush();
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IntervalElapsed method should handle potential exceptions from the Flush() method to prevent unhandled exceptions from crashing the timer thread. Consider wrapping the flush operation in a try-catch block.

Suggested change
private void IntervalElapsed(object? sender, ElapsedEventArgs e)
{
_timer.Enabled = false;
lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
Flush();
}
}
private void IntervalElapsed(object? sender, ElapsedEventArgs e)
{
_timer.Enabled = false;
lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
try
{
Flush();
}
catch (Exception ex)
{
// Log the exception or handle it appropriately
// Consider using a diagnostic logger here
}
}
}
}

Comment on lines +24 to +31

private static int ClampBatchCount(int batchCount)
{
return batchCount <= 0
? 1
: batchCount > 1_000_000
? 1_000_000
: batchCount;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClampBatchCount method uses magic numbers (1 and 1,000,000) without explanation. Consider defining these as named constants with clear documentation about why these specific limits were chosen.

Suggested change
private static int ClampBatchCount(int batchCount)
{
return batchCount <= 0
? 1
: batchCount > 1_000_000
? 1_000_000
: batchCount;
private const int MinBatchCount = 1;
private const int MaxBatchCount = 1_000_000;
private static int ClampBatchCount(int batchCount)
{
return batchCount <= 0
? MinBatchCount
: batchCount > MaxBatchCount
? MaxBatchCount
: batchCount;
}

Comment on lines +33 to 39

private static TimeSpan ClampBatchInterval(TimeSpan batchInterval)
{
return batchInterval.TotalMilliseconds is <= 0 or > int.MaxValue
? TimeSpan.FromMilliseconds(int.MaxValue)
: batchInterval;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClampBatchInterval method uses int.MaxValue as the maximum timeout, but this could result in very long timeouts that might not be practical. Consider using a more reasonable maximum timeout (e.g., 30 seconds as mentioned in the documentation).

Suggested change
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval)
{
return batchInterval.TotalMilliseconds is <= 0 or > int.MaxValue
? TimeSpan.FromMilliseconds(int.MaxValue)
: batchInterval;
}
private static readonly TimeSpan MaxBatchInterval = TimeSpan.FromSeconds(30);
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval)
{
return batchInterval.TotalMilliseconds <= 0
? TimeSpan.FromMilliseconds(1)
: batchInterval > MaxBatchInterval
? MaxBatchInterval
: batchInterval;
}

Comment on lines +1907 to +1916
public int InternalBatchSize { get; set; } = 100;

/// <summary>
/// This API will be removed in the future.
/// </summary>
/// <remarks>
/// Time after which all items in the buffer are sent, regardless of <see cref="InternalBatchSize"/>.
/// <para>Must not exceed 30 seconds.</para>
/// </remarks>
public TimeSpan InternalBatchTimeout { get; set; } = TimeSpan.FromSeconds(5);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternalBatchSize and InternalBatchTimeout properties are marked as experimental but lack clear validation. Consider adding validation to ensure these values are within reasonable bounds, similar to the clamping done in DefaultSentryStructuredLogger.

Suggested change
public int InternalBatchSize { get; set; } = 100;
/// <summary>
/// This API will be removed in the future.
/// </summary>
/// <remarks>
/// Time after which all items in the buffer are sent, regardless of <see cref="InternalBatchSize"/>.
/// <para>Must not exceed 30 seconds.</para>
/// </remarks>
public TimeSpan InternalBatchTimeout { get; set; } = TimeSpan.FromSeconds(5);
/// <summary>
/// This API will be removed in the future.
/// </summary>
/// <remarks>
/// Threshold of items in the buffer when sending all items, regardless of <see cref="InternalBatchTimeout"/>.
/// Valid range: 1 to 1,000,000.
/// </remarks>
public int InternalBatchSize
{
get => _internalBatchSize;
set => _internalBatchSize = Math.Max(1, Math.Min(1_000_000, value));
}
private int _internalBatchSize = 100;

Comment on lines 84 to 89
lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
Flush();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking strategy could potentially cause performance issues under high load. Consider whether the lock needs to be held for the entire duration of the flush operation, or if it could be released earlier to allow more items to be enqueued while the flush is in progress.

Suggested change
lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
Flush();
}
lock (_lock)
{
if (!_logs.IsEmpty && e.SignalTime > _lastFlush)
{
// Take a snapshot and release the lock quickly
var logsToFlush = _logs.ToArrayAndClear();
_lastFlush = DateTime.UtcNow;
// Release lock before expensive I/O operation
Task.Run(() =>
{
try
{
_ = _hub.CaptureEnvelope(Envelope.FromLog(new StructuredLog(logsToFlush)));
}
catch (Exception ex)
{
// Handle exception appropriately
}
});
}
}

Copy link
Member

@bruno-garcia bruno-garcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a benchmark for this too, we definitely want to get an idea of how this behaves under load. In the tests, even if you're not committing that consider writing something that writes in a tight loop using a few threads (like 1 per CPU core you have) and check how that behaves too.


var count = _count;
_count = 0;
Array.Clear(_array, 0, count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to go with Buffer.BlockCopy(), not sure it's still relevant, but back in the day if this was critical path, there was a perf advantage. Since we cal it to TooArrayAndClear I imagine might be but worth profiling it before doing any perf improvement changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.BlockCopy only works with primitive types.
There is Buffer.MemoryCopy, but I think this one is also intended for unmanaged structs ... not sure about reference types.

else if (_logs.IsFull)
{
_timer.Enabled = false;
Flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing way too much work in the critical path.

Think about what absolutely must happen inside the lock, and what we can move/copy around to have it happen async.

Once the reference that's mutated concurrently is swapped with a new one, new messages can start logging, and you can continue async doing capture stuff. Like allocating the strucutered log, envelope, calling Capture etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be better off using a couple of arrays and doing some reference swap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did start out with a lock-free solution for non-flushing logs.
And then realized a flushing log takes 2-3 ms.
But I was not considering the context of Unity where this is unacceptable.
I'm going back to the (mostly) lock-free solution I had initially, and will build up on that to be mostly lock-free with something like a Ring-Buffer of Batch-Buffers.

Thanks for the feedback ... also thanks @bitsandfoxes.

@Flash0ver Flash0ver marked this pull request as draft July 1, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants