[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

mergify · 2025-10-14T13:03:16Z

Type of change

Enhancement

Proposed commit message

 Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs 
 tests as a byproduct of errors introduced with newer storage library versions.

NOTE

The parquet v18 has a dependency of a newer google storage library version.

cloud.google.com/go/storage v1.49.0 -> cloud.google.com/go/storage v1.52.0

This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Potential larger memory footprint when using parquet decoding at a smaller scale.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Closes [libbeat][chore] - Update "apache/arrow" parquet library to the latest v18 version in their new go specific repo #45573

Use cases

Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.

Screenshots

Logs

Analysis:

NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM

+----------------------------------------------------+
| Parquet-go Library: v17 vs. v18 Benchmark Analysis |
+----------------------------------------------------+

Summary

Massive Speed Improvement for Large Files:
v18 is ~2.6× faster than v17 for large-scale data processing.
Increased Memory Consumption:
This performance gain comes at the cost of ~2.6× more memory
usage and ~1.5× more allocations.
Performance Regression on Smaller Files:
For smaller files (e.g., vpc_flow.parquet), v18 is ~3× slower
and uses ~2.5× more memory.
High CPU Scaling:
v18 scales well with more CPU cores, showing up to ~4.5×
performance improvement from 1 → 10 cores.
Conclusion:
v18 is best for high-throughput scenarios with ample memory.
For memory-constrained or small-file workloads, its overhead
is significant if batch_size is not constrained.

Benchmark Environment

Package: github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet
Go Version: goos: darwin, goarch: arm64
CPU: Apple M1 Max
Concurrency Levels: 1, 2, 4, 8, 10

1. Large File Processing – `taxi_2023_1.parquet`

Single large Parquet file (47.7 MB), batch size = 10,000.

Version	Cores	Time per Op (ns/op)	Mem per Op (B/op)	Allocs per Op
v17	10	7,113,368,875	7,162,300,232 (~7.16 GB)	40,872,797
v18	10	2,779,433,542	18,869,783,112 (~18.87 GB)	61,709,457

Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.

2. Small File Processing – `vpc_flow.parquet`

Smaller file (33 KB), batch size = 1,000, at 4 CPU cores.

Version	Cores	Time per Op (ns/op)	Mem per Op (B/op)	Allocs per Op
v17	4	7,139,663	15,266,732 (~15.27 MB)	55,042
v18	4	22,460,141	37,518,437 (~37.52 MB)	188,593

Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.

v18 Library – CPU Scaling & Parallelism

Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark: BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000

Cores	Time per Op (ns/op)	Speedup vs. 1 Core
1	41,729,156	1.00×
2	20,918,817	1.99×
4	12,080,647	3.45×
8	9,609,640	4.34×
10	9,251,827	4.51×

Serial vs. Parallel Processing

Scenario: Processing a single file (batch size = 1,000).
Benchmark: Read_a_single_row_from_a_single_file...

Mode	Cores	Time per Op (ns/op)
Serial	10	2,007,353
Parallel	10	880,824

Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.

Memory & Allocation Analysis

Memory remains stable across CPU counts but grows with batch size.

Scenario: Processing single files (Serial, 4 cores)

Benchmark	Batch Size	Mem per Op (B/op)	Allocs per Op
...in_batches_of_1000-4	1,000	5,537,257	22,416
...in_batches_of_10000-4	10,000	13,670,323	22,460

Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.

Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.

This is an automatic backport of pull request #45574 done by [Mergify](https://mergify.com).

…er to v18 (#45574) Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs tests as a byproduct of errors introduced with newer storage library versions. (cherry picked from commit b7c5a85) # Conflicts: # NOTICE.txt # go.mod # go.sum

mergify · 2025-10-14T13:03:19Z

Cherry-pick of b7c5a85 has failed:

On branch mergify/bp/9.1/pr-45574
Your branch is up to date with 'origin/9.1'.

You are currently cherry-picking commit b7c5a8509.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG-developer.next.asciidoc
	modified:   x-pack/filebeat/input/gcs/input_test.go
	modified:   x-pack/libbeat/reader/parquet/parquet.go
	modified:   x-pack/libbeat/reader/parquet/parquet_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   NOTICE.txt
	both modified:   go.mod
	both modified:   go.sum

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

github-actions · 2025-10-14T13:03:27Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2025-10-14T13:03:32Z

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

ShourieG · 2025-10-15T11:34:36Z

@khushijain21, why did we merge this before fixing the conflicts and the extra CHANGELOG entries ? The CHANGELOG needed to be cleaned as in this case, the mergify back-port process using cherry-pick brought in 3 extra entries which are not valid for this backport.

mergify bot requested a review from a team as a code owner October 14, 2025 13:03

mergify bot added the backport label Oct 14, 2025

mergify bot requested a review from a team as a code owner October 14, 2025 13:03

mergify bot added the conflicts There is a conflict in the backported pull request label Oct 14, 2025

mergify bot assigned ShourieG Oct 14, 2025

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2025

mergify bot mentioned this pull request Oct 14, 2025

[libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #45574

Merged

6 tasks

github-actions bot added enhancement Team:Security-Service Integrations Security Service Integrations Team libbeat:reader labels Oct 14, 2025

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2025

khushijain21 approved these changes Oct 15, 2025

View reviewed changes

khushijain21 added 2 commits October 15, 2025 11:43

Merge branch '9.1' into mergify/bp/9.1/pr-45574

be4e13e

fix conflicts

31c32ff

khushijain21 enabled auto-merge (squash) October 15, 2025 06:28

khushijain21 merged commit 0fc0100 into 9.1 Oct 15, 2025
205 of 208 checks passed

khushijain21 deleted the mergify/bp/9.1/pr-45574 branch October 15, 2025 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

elasticmachine commented Oct 14, 2025

Uh oh!

Uh oh!

ShourieG commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

[9.1](backport #45574) [libbeat][chore]: Updated "apache/arrow" library used in parquet reader to v18 #47087

Uh oh!

Conversation

mergify bot commented Oct 14, 2025

Type of change

Proposed commit message

NOTE

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Analysis:

Summary

Benchmark Environment

1. Large File Processing – taxi_2023_1.parquet

2. Small File Processing – vpc_flow.parquet

v18 Library – CPU Scaling & Parallelism

Serial vs. Parallel Processing

Memory & Allocation Analysis

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

🤖 GitHub comments

Uh oh!

elasticmachine commented Oct 14, 2025

Uh oh!

Uh oh!

ShourieG commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Large File Processing – `taxi_2023_1.parquet`

2. Small File Processing – `vpc_flow.parquet`

ShourieG commented Oct 15, 2025 •

edited

Loading