Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Oct 14, 2025

Type of change

  • Enhancement

Proposed commit message

 Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs 
 tests as a byproduct of errors introduced with newer storage library versions.

NOTE

The parquet v18 has a dependency of a newer google storage library version.

cloud.google.com/go/storage v1.49.0 -> cloud.google.com/go/storage v1.52.0

This upgrade resulted in a response change in the gcs tests, where sdk methods used in some scenarios now return more context in case of a 404 error. The respective tests have been updated to align to this change.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Potential larger memory footprint when using parquet decoding at a smaller scale.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Much faster processing times when using parquet decoding at larger scales with the impact of smaller scale usage becoming more demanding in terms of memory.

Screenshots

Logs

Analysis:

NOTE : The following summary was generated and then edited manually after feeding the relevant benchmark data into an LLM

+----------------------------------------------------+
| Parquet-go Library: v17 vs. v18 Benchmark Analysis |
+----------------------------------------------------+

Summary

  • Massive Speed Improvement for Large Files:
    v18 is ~2.6× faster than v17 for large-scale data processing.
  • Increased Memory Consumption:
    This performance gain comes at the cost of ~2.6× more memory
    usage and ~1.5× more allocations.
  • Performance Regression on Smaller Files:
    For smaller files (e.g., vpc_flow.parquet), v18 is ~3× slower
    and uses ~2.5× more memory.
  • High CPU Scaling:
    v18 scales well with more CPU cores, showing up to ~4.5×
    performance improvement from 1 → 10 cores.
  • Conclusion:
    v18 is best for high-throughput scenarios with ample memory.
    For memory-constrained or small-file workloads, its overhead
    is significant if batch_size is not constrained.

Benchmark Environment

  • Package: github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet
  • Go Version: goos: darwin, goarch: arm64
  • CPU: Apple M1 Max
  • Concurrency Levels: 1, 2, 4, 8, 10

1. Large File Processing – taxi_2023_1.parquet

Single large Parquet file (47.7 MB), batch size = 10,000.

Version Cores Time per Op (ns/op) Mem per Op (B/op) Allocs per Op
v17 10 7,113,368,875 7,162,300,232 (~7.16 GB) 40,872,797
v18 10 2,779,433,542 18,869,783,112 (~18.87 GB) 61,709,457

Analysis:
v18 is ~2.56× faster, but uses ~2.63× more memory and ~1.51× more allocations.


2. Small File Processing – vpc_flow.parquet

Smaller file (33 KB), batch size = 1,000, at 4 CPU cores.

Version Cores Time per Op (ns/op) Mem per Op (B/op) Allocs per Op
v17 4 7,139,663 15,266,732 (~15.27 MB) 55,042
v18 4 22,460,141 37,518,437 (~37.52 MB) 188,593

Analysis:
v18 is ~3.15× slower, uses ~2.46× more memory, and makes ~3.43×
more allocations.


v18 Library – CPU Scaling & Parallelism

Scenario: Processing multiple files in parallel (batch size = 1,000).
Benchmark: BenchmarkReadParquet/Process_multiple_files_parallelly_in_batches_of_1000

Cores Time per Op (ns/op) Speedup vs. 1 Core
1 41,729,156 1.00×
2 20,918,817 1.99×
4 12,080,647 3.45×
8 9,609,640 4.34×
10 9,251,827 4.51×

Serial vs. Parallel Processing

Scenario: Processing a single file (batch size = 1,000).
Benchmark: Read_a_single_row_from_a_single_file...

Mode Cores Time per Op (ns/op)
Serial 10 2,007,353
Parallel 10 880,824

Analysis:
Parallel implementation is ~2.28× faster, likely due to parallelizing
row-group reads.


Memory & Allocation Analysis

Memory remains stable across CPU counts but grows with batch size.

Scenario: Processing single files (Serial, 4 cores)

Benchmark Batch Size Mem per Op (B/op) Allocs per Op
...in_batches_of_1000-4 1,000 5,537,257 22,416
...in_batches_of_10000-4 10,000 13,670,323 22,460

Analysis:
A 10× larger batch increases memory ~2.5× but barely changes allocation count,
indicating efficient buffer reuse.


Conclusion:
Performance vs. memory is a trade-off. With v18, while using smaller workloads, batch_size will play a significant role
when it comes to the memory footprint. Small workloads with large batch_size will consume significantly more memory than v17.


This is an automatic backport of pull request #45574 done by [Mergify](https://mergify.com).

…er to v18 (#45574)

Updated "apache/arrow" library used in parquet reader to v18 and fixed gcs
 tests as a byproduct of errors introduced with newer storage library versions.

(cherry picked from commit b7c5a85)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum
@mergify mergify bot requested a review from a team as a code owner October 14, 2025 13:03
@mergify mergify bot added the backport label Oct 14, 2025
@mergify mergify bot requested a review from a team as a code owner October 14, 2025 13:03
@mergify mergify bot added the conflicts There is a conflict in the backported pull request label Oct 14, 2025
Copy link
Contributor Author

mergify bot commented Oct 14, 2025

Cherry-pick of b7c5a85 has failed:

On branch mergify/bp/9.1/pr-45574
Your branch is up to date with 'origin/9.1'.

You are currently cherry-picking commit b7c5a8509.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG-developer.next.asciidoc
	modified:   x-pack/filebeat/input/gcs/input_test.go
	modified:   x-pack/libbeat/reader/parquet/parquet.go
	modified:   x-pack/libbeat/reader/parquet/parquet_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   NOTICE.txt
	both modified:   go.mod
	both modified:   go.sum

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2025
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2025
@khushijain21 khushijain21 enabled auto-merge (squash) October 15, 2025 06:28
@khushijain21 khushijain21 merged commit 0fc0100 into 9.1 Oct 15, 2025
205 of 208 checks passed
@khushijain21 khushijain21 deleted the mergify/bp/9.1/pr-45574 branch October 15, 2025 08:13
@ShourieG
Copy link
Contributor

ShourieG commented Oct 15, 2025

@khushijain21, why did we merge this before fixing the conflicts and the extra CHANGELOG entries ? The CHANGELOG needed to be cleaned as in this case, the mergify back-port process using cherry-pick brought in 3 extra entries which are not valid for this backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport conflicts There is a conflict in the backported pull request enhancement libbeat:reader Team:Security-Service Integrations Security Service Integrations Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants