perf(v2): static parquet page buffer size #4208

kolesnikovae · 2025-05-28T01:09:26Z

In this PR, I'm increasing the parquet buffers and making them static.

Currently, parquet buffer sizes are determined based on the table size, with lower and upper bounds of 64KB and 1MB, respectively. This approach helps prevent high memory consumption in the segment-writer, compaction-worker, and query-backend services, which typically handle many small tables concurrently. While this strategy works well in the common case – keeping memory usage low – it can negatively impact query performance in certain scenarios, and are generally overly conservative.

An example of what performance degradation looks like and how it can be identified.

Given a query trace with query-backend Invoke span (3ac41dbe32f4c111c67181c236628a7e). It spent 2.3s of CPU time and took 17s of the real time. The span profile:

Here we see that CPU time was spent in object storage i/o and parquet decoding (both amd64-avx2 and arm64-naïve implementations demonstrate comparable performance; the profile is collected on arm64).

The bottleneck is RepeatedRowColumnIterator, where we actually read pages – 52 events (the last one is summary):

The even log suggests that:

Fetched around 50MB of data, ~4.5K profiles.
Fetched 50 adjacent pages => the data range is contiguous.

It matches expectations: 1MB page, the data range is contiguous, the data chunk is of a manageable size (50MB). Although I find it weird that the number of rows per page varies this much.

The change won't resolve the problem: we trade memory for I/O throughput.

kolesnikovae · 2025-05-28T04:51:31Z

The decoding performance might be suboptimal (read: the choice of encoding might be suboptimal). We should benchmark it.
We may run into OOM issues because of the increased read buffer.
We do not want to perform a remote call (Get) for every buffer read (refill).
It would be nice to have column-specific buffers: many of them are tiny but we will retain a large buffer, utilizing memory inefficiently.
We store and process way too much redundant data of low value.

kolesnikovae · 2025-05-28T05:15:11Z

pkg/experiment/block/block.go

+const (
+	// Each 2MB translates to an I/O read op.
+	parquetReadBufferSize      = 2 << 20
+	parquetPageWriteBufferSize = 1 << 20
+)


It does make sense to increase the parquetReadBufferSize size further, e.g., 4MB. However, we need to estimate the actual impact on memory consumption.

I'm less confident about parquetPageWriteBufferSize.

kolesnikovae · 2025-05-28T06:52:13Z

I recently have discovered quite an interesting page access log:

25MB, ~40 page reads, 9/10 of fetched rows are discarded. For the given query ({service_name="x", profile_type="y"}), the access should be fully sequential, without gaps.

kolesnikovae · 2025-05-29T04:08:27Z

Regarding the previous comment on gaps in row ranges: this is expected due to the row ordering – first by series_id, then by timestamp. As a result, when the time range isn't fully covered, we must skip some rows for each series. Unfortunately, due to the way parquet-go is implemented, this almost always leads to a request to object storage – it's perfectly possible that a full scan is cheaper in this case. Also, see #2192: I would consider swapping the timestamp and series_id sort order: series_id, timestamp -> timestamp, series_id.

While investigating the issue, I discovered that #4036 causes parquet reader to perform SeekToRow in an extremely inefficient way: on every call, it scans the entire column chunk up to the specified row from the very beginning (even if advancing by just one row); #4036 is fixed in #4209 – enabling the index significantly improves performance in pathological cases with high read amplification.

Another interesting finding: parquet.List of structs colocates fields on the same page. This means that we should never create an individual iterator for each of the columns in such cases (example): in fact, we fetch same pages repeatedly.

I'll create issues.

simonswine · 2025-05-29T11:11:22Z

Regarding the previous comment on gaps in row ranges: this is expected due to the row ordering – first by series_id, then by timestamp. As a result, when the time range isn't fully covered, we must skip some rows for each series. Unfortunately, due to the way parquet-go is implemented, this almost always leads to a request to object storage – it's perfectly possible that a full scan is cheaper in this case. Also, see #2192: I would consider swapping the timestamp and series_id sort order: series_id, timestamp -> timestamp, series_id.

Theoretically we could also separate different profile_types into different row groups, each would of those we could order by timestamps, so we might be able to skip them fairly efficiently at edges, while still having the same profile types physically together. I also think the full scan idea sounds promising.

simonswine

LGTM! Thanks for tweaking the buffer sizes and the investigations/trialing around it.

I think we should make sure that we don't miss the ideas of that issue and create follow up tasks.

kolesnikovae force-pushed the perf/static-parquet-page-size branch from aa1a4b3 to 7d8894e Compare May 28, 2025 04:45

kolesnikovae force-pushed the perf/static-parquet-page-size branch from 7d8894e to a37a025 Compare May 28, 2025 05:02

kolesnikovae commented May 28, 2025

View reviewed changes

kolesnikovae marked this pull request as ready for review May 29, 2025 04:09

kolesnikovae requested a review from a team as a code owner May 29, 2025 04:09

perf(v2): static parquet page buffer size

c0624d5

kolesnikovae force-pushed the perf/static-parquet-page-size branch from a37a025 to c0624d5 Compare May 29, 2025 04:37

simonswine approved these changes May 29, 2025

View reviewed changes

kolesnikovae merged commit 77754da into main May 29, 2025
24 checks passed

kolesnikovae deleted the perf/static-parquet-page-size branch May 29, 2025 11:27

simonswine pushed a commit that referenced this pull request May 29, 2025

perf(v2): static parquet page buffer size (#4208)

1f30551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(v2): static parquet page buffer size #4208

perf(v2): static parquet page buffer size #4208

kolesnikovae commented May 28, 2025 •

edited

Loading

Uh oh!

kolesnikovae commented May 28, 2025 •

edited

Loading

Uh oh!

kolesnikovae May 28, 2025

Uh oh!

kolesnikovae commented May 28, 2025

Uh oh!

kolesnikovae commented May 29, 2025 •

edited

Loading

Uh oh!

simonswine commented May 29, 2025

Uh oh!

simonswine left a comment

Uh oh!

Uh oh!

Uh oh!

perf(v2): static parquet page buffer size #4208

perf(v2): static parquet page buffer size #4208

Conversation

kolesnikovae commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolesnikovae commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolesnikovae May 28, 2025

Choose a reason for hiding this comment

Uh oh!

kolesnikovae commented May 28, 2025

Uh oh!

kolesnikovae commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonswine commented May 29, 2025

Uh oh!

simonswine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kolesnikovae commented May 28, 2025 •

edited

Loading

kolesnikovae commented May 28, 2025 •

edited

Loading

kolesnikovae commented May 29, 2025 •

edited

Loading