Skip to content

perf(v2): static parquet page buffer size #4208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 29, 2025

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented May 28, 2025

In this PR, I'm increasing the parquet buffers and making them static.

Currently, parquet buffer sizes are determined based on the table size, with lower and upper bounds of 64KB and 1MB, respectively. This approach helps prevent high memory consumption in the segment-writer, compaction-worker, and query-backend services, which typically handle many small tables concurrently. While this strategy works well in the common case – keeping memory usage low – it can negatively impact query performance in certain scenarios, and are generally overly conservative.


An example of what performance degradation looks like and how it can be identified.

Given a query trace with query-backend Invoke span (3ac41dbe32f4c111c67181c236628a7e). It spent 2.3s of CPU time and took 17s of the real time. The span profile:
image

Here we see that CPU time was spent in object storage i/o and parquet decoding (both amd64-avx2 and arm64-naïve implementations demonstrate comparable performance; the profile is collected on arm64).

The bottleneck is RepeatedRowColumnIterator, where we actually read pages – 52 events (the last one is summary):
image

image

The even log suggests that:

  • Fetched around 50MB of data, ~4.5K profiles.
  • Fetched 50 adjacent pages => the data range is contiguous.

It matches expectations: 1MB page, the data range is contiguous, the data chunk is of a manageable size (50MB). Although I find it weird that the number of rows per page varies this much.

The change won't resolve the problem: we trade memory for I/O throughput.

@kolesnikovae kolesnikovae force-pushed the perf/static-parquet-page-size branch from aa1a4b3 to 7d8894e Compare May 28, 2025 04:45
@kolesnikovae
Copy link
Collaborator Author

kolesnikovae commented May 28, 2025

  • The decoding performance might be suboptimal (read: the choice of encoding might be suboptimal). We should benchmark it.
  • We may run into OOM issues because of the increased read buffer.
  • We do not want to perform a remote call (Get) for every buffer read (refill).
  • It would be nice to have column-specific buffers: many of them are tiny but we will retain a large buffer, utilizing memory inefficiently.
  • We store and process way too much redundant data of low value.

@kolesnikovae kolesnikovae force-pushed the perf/static-parquet-page-size branch from 7d8894e to a37a025 Compare May 28, 2025 05:02
Comment on lines +26 to +30
const (
// Each 2MB translates to an I/O read op.
parquetReadBufferSize = 2 << 20
parquetPageWriteBufferSize = 1 << 20
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense to increase the parquetReadBufferSize size further, e.g., 4MB. However, we need to estimate the actual impact on memory consumption.

I'm less confident about parquetPageWriteBufferSize.

@kolesnikovae
Copy link
Collaborator Author

I recently have discovered quite an interesting page access log:

image

25MB, ~40 page reads, 9/10 of fetched rows are discarded. For the given query ({service_name="x", profile_type="y"}), the access should be fully sequential, without gaps.

@kolesnikovae
Copy link
Collaborator Author

kolesnikovae commented May 29, 2025

Regarding the previous comment on gaps in row ranges: this is expected due to the row ordering – first by series_id, then by timestamp. As a result, when the time range isn't fully covered, we must skip some rows for each series. Unfortunately, due to the way parquet-go is implemented, this almost always leads to a request to object storage – it's perfectly possible that a full scan is cheaper in this case. Also, see #2192: I would consider swapping the timestamp and series_id sort order: series_id, timestamp -> timestamp, series_id.

While investigating the issue, I discovered that #4036 causes parquet reader to perform SeekToRow in an extremely inefficient way: on every call, it scans the entire column chunk up to the specified row from the very beginning (even if advancing by just one row); #4036 is fixed in #4209 – enabling the index significantly improves performance in pathological cases with high read amplification.

Another interesting finding: parquet.List of structs colocates fields on the same page. This means that we should never create an individual iterator for each of the columns in such cases (example): in fact, we fetch same pages repeatedly.

I'll create issues.

@kolesnikovae kolesnikovae marked this pull request as ready for review May 29, 2025 04:09
@kolesnikovae kolesnikovae requested a review from a team as a code owner May 29, 2025 04:09
@kolesnikovae kolesnikovae force-pushed the perf/static-parquet-page-size branch from a37a025 to c0624d5 Compare May 29, 2025 04:37
@simonswine
Copy link
Contributor

Regarding the previous comment on gaps in row ranges: this is expected due to the row ordering – first by series_id, then by timestamp. As a result, when the time range isn't fully covered, we must skip some rows for each series. Unfortunately, due to the way parquet-go is implemented, this almost always leads to a request to object storage – it's perfectly possible that a full scan is cheaper in this case. Also, see #2192: I would consider swapping the timestamp and series_id sort order: series_id, timestamp -> timestamp, series_id.

Theoretically we could also separate different profile_types into different row groups, each would of those we could order by timestamps, so we might be able to skip them fairly efficiently at edges, while still having the same profile types physically together. I also think the full scan idea sounds promising.

Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for tweaking the buffer sizes and the investigations/trialing around it.

I think we should make sure that we don't miss the ideas of that issue and create follow up tasks.

@kolesnikovae kolesnikovae merged commit 77754da into main May 29, 2025
24 checks passed
@kolesnikovae kolesnikovae deleted the perf/static-parquet-page-size branch May 29, 2025 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants