[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of change
Proposed commit message
NOTE
The parquet v18 has a dependency of a newer google storage library version.
This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
Potential larger memory footprint when using parquet decoding at a smaller scale.
Author's Checklist
How to test this PR locally
Related issues
Use cases
Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.
Screenshots
Logs
Analysis:
NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM
Summary
v18 is ~2.6× faster than v17 for large-scale data processing.
This performance gain comes at the cost of ~2.6× more memory
usage and ~1.5× more allocations.
For smaller files (e.g.,
vpc_flow.parquet
), v18 is ~3× slowerand uses ~2.5× more memory.
v18 scales well with more CPU cores, showing up to ~4.5×
performance improvement from 1 → 10 cores.
v18 is best for high-throughput scenarios with ample memory.
For memory-constrained or small-file workloads, its overhead
is significant if batch_size is not constrained.
Benchmark Environment
github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet
goos: darwin
,goarch: arm64
1. Large File Processing –
taxi_2023_1.parquet
Single large Parquet file (47.7 MB), batch size = 10,000.
Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.
2. Small File Processing –
vpc_flow.parquet
Smaller file (33 KB), batch size = 1,000, at 4 CPU cores.
Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.
v18 Library – CPU Scaling & Parallelism
Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark:
BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000
Serial vs. Parallel Processing
Scenario: Processing a single file (batch size = 1,000).
Benchmark:
Read_a_single_row_from_a_single_file...
Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.
Memory & Allocation Analysis
Memory remains stable across CPU counts but grows with batch size.
Scenario: Processing single files (Serial, 4 cores)
Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.
Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.
This is an automatic backport of pull request #45574 done by [Mergify](https://mergify.com).