-
-
Notifications
You must be signed in to change notification settings - Fork 220
feat(logs): add Buffering and Batching #4310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/logs
Are you sure you want to change the base?
Conversation
|
/// Sentry Specification: <see href="https://develop.sentry.dev/sdk/telemetry/spans/batch-processor/"/>. | ||
/// OpenTelemetry spec: <see href="https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md"/>. | ||
/// </remarks> | ||
internal sealed class BatchProcessor : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could think about giving this a more specific/descriptive name. BatchProcessor
is very generic and we could have other batch processors (for metrics or other things) in the future.
If we weren't worried about long names, I guess it would be a StructuredLogBatchProcessor
... Or a more concise alternative might be LogBatch
(so you'd do something like _logBatch.Enqueue(log)
)?
Alternatively, it could be made generic (like the BatchBuffer
class that it uses), but I'd recommend we do that if/when we need a batch processor for a second category of Sentry events (it's hard to make something properly generic until you have 2 or 3 different concrete implementations).
{ | ||
private readonly IHub _hub; | ||
private readonly BatchProcessorTimer _timer; | ||
private readonly BatchBuffer<SentryLog> _logs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have a _lock
and a _logs
buffer here.
Any reason not to use a ConcurrentQueue
or a ConcurrentQueueLite? It'd be good to have fewer classes to maintain (unless they're truly necessary).
private readonly IHub _hub; | ||
private readonly BatchProcessorTimer _timer; | ||
private readonly BatchBuffer<SentryLog> _logs; | ||
private readonly Lock _lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do want to control concurrency in this class (rather than using one of the build in data structures supporting concurrency), consider using something like ReaderWriterLockSlim
... this is mainly useful when you have lots of reads and a few writes. However if most of the operations are writes then it doesn't buy us much/anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did indeed start with a more elaborate locking-strategy with ReaderWriterLockSlim
.
And multi-targeted with ReaderWriterLockon
NETFRAMEWORKwhere
Thread.Abort` is allowed, since this code path may be invoked from user code.
Since we do expect more writes than reads, I then simplified it to just a "full-lock-all" approach for now, with the option to optimize later.
I just did a super quick benchmark: A "flushing" call of EnqueueCore(SentryLog)
with a capacity of 100
takes between 2 and 3 milliseconds.
I'll do a bit of experimenting and benchmarking over the weekend with different locking strategies.
_batchProcessor = new BatchProcessor(hub, ClampBatchCount(options.Experimental.InternalBatchSize), ClampBatchInterval(options.Experimental.InternalBatchTimeout)); | ||
} | ||
|
||
private static int ClampBatchCount(int batchCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this go in the setter for the InternalBatchSize
on the options instead?
If an invalid value has been configured, should we either throw an exception (we'd only do that if it happened when initialising the options - we don't generally want the SDK to throw exceptions otherwise) or at the very least log a message letting people know they're not configuring things correctly?
: batchCount; | ||
} | ||
|
||
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, could this be moved to the setter?
/// <remarks> | ||
/// Threshold of items in the buffer when sending all items, regardless of <see cref="InternalBatchTimeout"/>. | ||
/// </remarks> | ||
public int InternalBatchSize { get; set; } = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'll be removing this in future, presumably we need to decide what an appropriate "final" or "permanent" value for this will be. How are we going to determine that (and so what is the value in exposing this publicly, temporarily)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docs/spec suggests, as a starting point, to use 100 items and 5 seconds: https://develop.sentry.dev/sdk/telemetry/logs/#buffering
But to use a more elaborate batching strategy later on, via the Batch Processor - currently work in progress - https://develop.sentry.dev/sdk/telemetry/spans/batch-processor/
The Batch Processor should also be used for span streaming.
The current guideline is: start with 5 seconds, with the option to go as high as 30 seconds, with an auto-flush of 1MiB in size.
For the 1MiB, we could utilize Utf8JsonWriter.BytesCommitted + Utf8JsonWriter.BytesPending
, but that would require a bit of rewriting / redesigning our Background-Worker / Transport layer.
We could also keep these options, but make them internal
, to be able to choose different values depending on the SDK / Integration.
I was thinking about - during the [Experimental]
phase - trying to explore the best initial values during dogfooding.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends where we expect to collect the feedback that will inform ultimately how these values are determined.
If we're expecting SDK users to play around with different values and get back to us to let us know what the ideal values for these things are, then it makes sense for these options to be public.
From the docs, it doesn't look like that's not how we'll decide the values to be used here... so I reckon we either bake in the "simple approach to buffer" (described here) or implement the two triggers defined in the specification (5s or 1 MiB). The 1MiB trigger would require we calculate the size of each SentryLog
and track the cumulative size of the logs in the StructuredLog
.
Either way, I don't think we surface this up in the options to user. We could hold it in the state for the BatchProcessor
instead.
@sentry review |
On it! We are reviewing the PR and will provide feedback shortly. |
PR DescriptionThis pull request introduces a batch processing mechanism for Sentry logs. The goal is to improve performance and reduce overhead by collecting log messages and sending them to Sentry in batches, rather than individually. Click to see moreKey Technical Changes
Architecture Decisions
Dependencies and Interactions
Risk Considerations
Notable Implementation Details
|
src/Sentry/Internal/BatchBuffer.cs
Outdated
internal void Clear() | ||
{ | ||
if (_count == 0) | ||
{ | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Clear()
method has an early return for empty buffers, but Array.Clear()
is safe to call even on empty arrays. Consider removing the early return to simplify the code. However, if the early return is for performance reasons, add a comment explaining this optimization.
internal void Clear() | |
{ | |
if (_count == 0) | |
{ | |
return; | |
internal void Clear() | |
{ | |
var count = _count; | |
_count = 0; | |
Array.Clear(_array, 0, count); | |
} |
var added = _logs.TryAdd(log); | ||
Debug.Assert(added, $"Since we currently have no lock-free scenario, it's unexpected to exceed the {nameof(BatchBuffer<SentryLog>)}'s capacity."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Debug.Assert
statement on line 59 could be problematic in release builds where assertions are disabled. If this is a critical invariant that must be maintained, consider throwing an exception instead or using a different approach to ensure the condition is always checked.
var added = _logs.TryAdd(log); | |
Debug.Assert(added, $"Since we currently have no lock-free scenario, it's unexpected to exceed the {nameof(BatchBuffer<SentryLog>)}'s capacity."); | |
if (!added) | |
{ | |
throw new InvalidOperationException($"Failed to add item to {nameof(BatchBuffer<SentryLog>)}. This indicates a potential threading issue or buffer overflow."); | |
} |
|
||
private void IntervalElapsed(object? sender, ElapsedEventArgs e) | ||
{ | ||
_timer.Enabled = false; | ||
|
||
lock (_lock) | ||
{ | ||
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | ||
{ | ||
Flush(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IntervalElapsed
method should handle potential exceptions from the Flush()
method to prevent unhandled exceptions from crashing the timer thread. Consider wrapping the flush operation in a try-catch block.
private void IntervalElapsed(object? sender, ElapsedEventArgs e) | |
{ | |
_timer.Enabled = false; | |
lock (_lock) | |
{ | |
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | |
{ | |
Flush(); | |
} | |
} | |
private void IntervalElapsed(object? sender, ElapsedEventArgs e) | |
{ | |
_timer.Enabled = false; | |
lock (_lock) | |
{ | |
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | |
{ | |
try | |
{ | |
Flush(); | |
} | |
catch (Exception ex) | |
{ | |
// Log the exception or handle it appropriately | |
// Consider using a diagnostic logger here | |
} | |
} | |
} | |
} |
|
||
private static int ClampBatchCount(int batchCount) | ||
{ | ||
return batchCount <= 0 | ||
? 1 | ||
: batchCount > 1_000_000 | ||
? 1_000_000 | ||
: batchCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClampBatchCount
method uses magic numbers (1 and 1,000,000) without explanation. Consider defining these as named constants with clear documentation about why these specific limits were chosen.
private static int ClampBatchCount(int batchCount) | |
{ | |
return batchCount <= 0 | |
? 1 | |
: batchCount > 1_000_000 | |
? 1_000_000 | |
: batchCount; | |
private const int MinBatchCount = 1; | |
private const int MaxBatchCount = 1_000_000; | |
private static int ClampBatchCount(int batchCount) | |
{ | |
return batchCount <= 0 | |
? MinBatchCount | |
: batchCount > MaxBatchCount | |
? MaxBatchCount | |
: batchCount; | |
} |
|
||
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval) | ||
{ | ||
return batchInterval.TotalMilliseconds is <= 0 or > int.MaxValue | ||
? TimeSpan.FromMilliseconds(int.MaxValue) | ||
: batchInterval; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClampBatchInterval
method uses int.MaxValue
as the maximum timeout, but this could result in very long timeouts that might not be practical. Consider using a more reasonable maximum timeout (e.g., 30 seconds as mentioned in the documentation).
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval) | |
{ | |
return batchInterval.TotalMilliseconds is <= 0 or > int.MaxValue | |
? TimeSpan.FromMilliseconds(int.MaxValue) | |
: batchInterval; | |
} | |
private static readonly TimeSpan MaxBatchInterval = TimeSpan.FromSeconds(30); | |
private static TimeSpan ClampBatchInterval(TimeSpan batchInterval) | |
{ | |
return batchInterval.TotalMilliseconds <= 0 | |
? TimeSpan.FromMilliseconds(1) | |
: batchInterval > MaxBatchInterval | |
? MaxBatchInterval | |
: batchInterval; | |
} |
public int InternalBatchSize { get; set; } = 100; | ||
|
||
/// <summary> | ||
/// This API will be removed in the future. | ||
/// </summary> | ||
/// <remarks> | ||
/// Time after which all items in the buffer are sent, regardless of <see cref="InternalBatchSize"/>. | ||
/// <para>Must not exceed 30 seconds.</para> | ||
/// </remarks> | ||
public TimeSpan InternalBatchTimeout { get; set; } = TimeSpan.FromSeconds(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InternalBatchSize
and InternalBatchTimeout
properties are marked as experimental but lack clear validation. Consider adding validation to ensure these values are within reasonable bounds, similar to the clamping done in DefaultSentryStructuredLogger
.
public int InternalBatchSize { get; set; } = 100; | |
/// <summary> | |
/// This API will be removed in the future. | |
/// </summary> | |
/// <remarks> | |
/// Time after which all items in the buffer are sent, regardless of <see cref="InternalBatchSize"/>. | |
/// <para>Must not exceed 30 seconds.</para> | |
/// </remarks> | |
public TimeSpan InternalBatchTimeout { get; set; } = TimeSpan.FromSeconds(5); | |
/// <summary> | |
/// This API will be removed in the future. | |
/// </summary> | |
/// <remarks> | |
/// Threshold of items in the buffer when sending all items, regardless of <see cref="InternalBatchTimeout"/>. | |
/// Valid range: 1 to 1,000,000. | |
/// </remarks> | |
public int InternalBatchSize | |
{ | |
get => _internalBatchSize; | |
set => _internalBatchSize = Math.Max(1, Math.Min(1_000_000, value)); | |
} | |
private int _internalBatchSize = 100; |
lock (_lock) | ||
{ | ||
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | ||
{ | ||
Flush(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The locking strategy could potentially cause performance issues under high load. Consider whether the lock needs to be held for the entire duration of the flush operation, or if it could be released earlier to allow more items to be enqueued while the flush is in progress.
lock (_lock) | |
{ | |
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | |
{ | |
Flush(); | |
} | |
lock (_lock) | |
{ | |
if (!_logs.IsEmpty && e.SignalTime > _lastFlush) | |
{ | |
// Take a snapshot and release the lock quickly | |
var logsToFlush = _logs.ToArrayAndClear(); | |
_lastFlush = DateTime.UtcNow; | |
// Release lock before expensive I/O operation | |
Task.Run(() => | |
{ | |
try | |
{ | |
_ = _hub.CaptureEnvelope(Envelope.FromLog(new StructuredLog(logsToFlush))); | |
} | |
catch (Exception ex) | |
{ | |
// Handle exception appropriately | |
} | |
}); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a benchmark for this too, we definitely want to get an idea of how this behaves under load. In the tests, even if you're not committing that consider writing something that writes in a tight loop using a few threads (like 1 per CPU core you have) and check how that behaves too.
src/Sentry/Internal/BatchBuffer.cs
Outdated
|
||
var count = _count; | ||
_count = 0; | ||
Array.Clear(_array, 0, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to go with Buffer.BlockCopy()
, not sure it's still relevant, but back in the day if this was critical path, there was a perf advantage. Since we cal it to TooArrayAndClear
I imagine might be but worth profiling it before doing any perf improvement changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer.BlockCopy
only works with primitive types.
There is Buffer.MemoryCopy
, but I think this one is also intended for unmanaged
structs ... not sure about reference types.
else if (_logs.IsFull) | ||
{ | ||
_timer.Enabled = false; | ||
Flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing way too much work in the critical path.
Think about what absolutely must happen inside the lock, and what we can move/copy around to have it happen async.
Once the reference that's mutated concurrently is swapped with a new one, new messages can start logging, and you can continue async doing capture stuff. Like allocating the strucutered log, envelope, calling Capture etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be better off using a couple of arrays and doing some reference swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did start out with a lock-free solution for non-flushing logs.
And then realized a flushing log takes 2-3 ms.
But I was not considering the context of Unity where this is unacceptable.
I'm going back to the (mostly) lock-free solution I had initially, and will build up on that to be mostly lock-free with something like a Ring-Buffer of Batch-Buffers.
Thanks for the feedback ... also thanks @bitsandfoxes.
Add minimal support for Buffering / Batching for Sentry Logs.
Planned follow-up to avoid having high-frequency logging drop other, potentially more important, envelopes: #4306
Changes:
SentryStructuredLogger
is nowIDisposable
Dispose
d whenHub
isDispose
dSentryLog
is no longerISentryJsonSerializable
StructuredLog
isISentryJsonSerializable
SentryLog
instances for serializationDefaultSentryStructuredLogger
no longer directly captures and enqueues Log-EnvelopesBatchProcessor
insteadBatchProcessor
BatchProcessorTimer
to enable testingBatchBuffer
, which is a slim wrapper over anArray
SentryOptions
SentryOptions.InternalBatchSize
controls the buffer's item count thresholdSentryOptions.InternalBatchTimeout
controls the buffer's intervalinternal
) before removing[Experimental]
of the Sentry Logs feature#skip-changelog