feat: Bump elasticsearch version and improve methodology. #9

PacificViking · 2025-08-04T06:52:52Z

Description

Improves the outdated version and uses logsdb methodology as mentioned in #7 .

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

New elasticsearch benchmark runs.

Summary by CodeRabbit

New Features

Added support for extracting and mapping timestamp fields during Elasticsearch ingestion.
Enhanced error handling in benchmark orchestration with optional attach-on-error functionality.
Added command-line filtering to run benchmarks on specific datasets.

Bug Fixes

Ensured correct handling of shell commands with output redirection in Gzip benchmarking.

Refactor

Updated class names for consistency across benchmarking modules.
Updated Presto and ClickHouse references to specific versions and improved Docker image reproducibility.
Modified paths and configuration to match updated Presto and Elasticsearch versions.
Upgraded Elasticsearch to version 9.0.3 and switched to official base images for multiple components.
Improved Dockerfiles by standardizing base images and package versions.

Documentation

Updated methodology documents to reflect new software versions and features.

coderabbitai · 2025-08-04T06:52:59Z

Walkthrough

This update revises Dockerfiles to use official or version-pinned base images, upgrades ClickHouse and Elasticsearch versions, and updates corresponding documentation. Several benchmark class names are renamed for consistency. The Elasticsearch ingestion pipeline is enhanced to support timestamp extraction and index template application. The benchmark runner script is refactored for improved error handling and dataset selection.

Changes

Cohort / File(s)	Change Summary
ClickHouse Version Pinning & Naming `assets/clickhouse/Dockerfile`, `assets/clickhouse/main.py`, `assets/clickhouse/methodology.md`	Dockerfile now pins ClickHouse to a specific version (25.6) by digest. The methodology doc and code references are updated to match. Benchmark class renamed from `clickhouse_native_json_bench` to `clickhouse_bench`.
CLP Benchmark Naming `assets/clp/main.py`	Benchmark class renamed from `clp_s_bench` to `clp_bench`.
Elasticsearch Upgrade & Ingestion Enhancements `assets/elasticsearch/Dockerfile`, `assets/elasticsearch/ingest.py`, `assets/elasticsearch/main.py`, `assets/elasticsearch/methodology.md`	Elasticsearch upgraded from 8.6.2 to 9.0.3; Dockerfile uses official Ubuntu base. Ingestion script adds timestamp extraction via path, index template creation, and `_op_type` field. Methodology doc updated for version and logs data stream note. Ingestion method now passes timestamp path and logsdb flag.
Gzip Dockerfile & Shell Execution `assets/gzip/Dockerfile`, `assets/gzip/main.py`	Dockerfile uses official Ubuntu base and allows downgrades. All shell commands in main.py now explicitly use `shell=True` for proper redirection.
Overhead Test & Template Dockerfiles `assets/overhead_test/Dockerfile`, `assets/template/Dockerfile`	Both Dockerfiles now use the official Ubuntu base image instead of a custom one.
Presto CLP Benchmark Naming `assets/presto_clp/main.py`	Benchmark class renamed from `clp_presto_bench` to `presto_clp_bench`.
Presto Parquet Versioning and Paths `assets/presto_parquet/Dockerfile`, `assets/presto_parquet/include/etc_worker/config.properties`, `assets/presto_parquet/main.py`	Dockerfile uses a specific Presto version and commit hashes, updates build process and dependencies. Config and code paths updated to match new Presto version/build layout.
Benchmark Runner Refactor `scripts/benchall.py`	Benchmark imports updated for new class names. Only the Elasticsearch benchmark is enabled. Adds dataset name filtering, improved error handling with `attach_on_error`, and updates the `run` function signature.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant benchall.py
    participant BenchmarkClass
    participant DockerContainer

    User->>benchall.py: Run script with optional dataset argument
    benchall.py->>BenchmarkClass: Initialize benchmark (e.g., elasticsearch_bench)
    loop For each dataset
        benchall.py->>BenchmarkClass: run(bench_target, attach, attach_on_error)
        BenchmarkClass->>DockerContainer: Execute ingestion/search commands
        alt Error occurs
            benchall.py->>DockerContainer: Attach if attach_on_error is True
        end
    end

sequenceDiagram
    participant ingest.py
    participant Elasticsearch
    participant Dataset

    ingest.py->>Dataset: Read each JSON line
    ingest.py->>ingest.py: pop_by_path(line, timestamp_path)
    alt Timestamp found
        ingest.py->>ingest.py: Set @timestamp to extracted value
    else Not found
        ingest.py->>ingest.py: Set @timestamp to 0
    end
    ingest.py->>Elasticsearch: Create index template (logsdb, mapping, priority)
    ingest.py->>Elasticsearch: Bulk ingest with _op_type="create"

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

kirkrodrigues

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 13

🔭 Outside diff range comments (1)

assets/clickhouse/methodology.md (1)

3-11: Documentation still references the latest tag—update for version 25.6

Lines 9-11 state the benchmark uses clickhouse/clickhouse-server:latest, which conflicts with the newly-pinned 25.6 image (and could mislead future maintainers).

Please replace the reference with the exact tag/digest used in assets/clickhouse/Dockerfile to keep the methodology accurate.

♻️ Duplicate comments (1)

assets/template/Dockerfile (1)

5-5: Same dependency check & image-size clean-up applies here

The observations made for assets/overhead_test/Dockerfile equally apply to this template Dockerfile: verify run-time deps did not vanish and purge the apt lists after installation to avoid a bloated benchmark image.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22783a0 and 03772f1.

📒 Files selected for processing (17)

assets/clickhouse/Dockerfile (1 hunks)
assets/clickhouse/main.py (2 hunks)
assets/clickhouse/methodology.md (2 hunks)
assets/clp/main.py (2 hunks)
assets/elasticsearch/Dockerfile (2 hunks)
assets/elasticsearch/ingest.py (2 hunks)
assets/elasticsearch/main.py (1 hunks)
assets/elasticsearch/methodology.md (2 hunks)
assets/gzip/Dockerfile (1 hunks)
assets/gzip/main.py (1 hunks)
assets/overhead_test/Dockerfile (1 hunks)
assets/presto_clp/main.py (2 hunks)
assets/presto_parquet/Dockerfile (2 hunks)
assets/presto_parquet/include/etc_worker/config.properties (1 hunks)
assets/presto_parquet/main.py (1 hunks)
assets/template/Dockerfile (1 hunks)
scripts/benchall.py (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.

📚 Learning: in benchmarking docker files, the --allow-downgrades flag in apt-get install commands is intentional...

Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.

Applied to files:

assets/overhead_test/Dockerfile
assets/template/Dockerfile
assets/presto_parquet/Dockerfile
assets/gzip/Dockerfile

🧬 Code Graph Analysis (4)

assets/clp/main.py (1)

src/template.py (1)

Benchmark (55-420)

assets/presto_clp/main.py (1)

src/template.py (1)

Benchmark (55-420)

assets/clickhouse/main.py (1)

src/template.py (1)

Benchmark (55-420)

assets/gzip/main.py (1)

src/template.py (2)

docker_execute (240-272)

bench_start (293-307)

🪛 Checkov (3.2.334)

assets/clickhouse/Dockerfile

[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-1: Ensure that a user for the container has been created

(CKV_DOCKER_3)

assets/presto_parquet/Dockerfile

[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-62: Ensure that a user for the container has been created

(CKV_DOCKER_3)

[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-62: Ensure that a user for the container has been created

(CKV_DOCKER_3)

assets/elasticsearch/Dockerfile

[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-24: Ensure that a user for the container has been created

(CKV_DOCKER_3)

[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-24: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🪛 Hadolint (2.12.0)

assets/presto_parquet/Dockerfile

[info] 43-43: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)

[info] 44-44: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)

[info] 58-58: Delete the apt-get lists after installing something

(DL3009)

[warning] 60-60: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

[info] 60-60: Avoid additional packages by specifying --no-install-recommends

(DL3015)

[info] 60-60: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)

assets/gzip/Dockerfile

[warning] 8-8: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

[info] 8-8: Delete the apt-get lists after installing something

(DL3009)

[info] 8-8: Avoid additional packages by specifying --no-install-recommends

(DL3015)

assets/elasticsearch/Dockerfile

[info] 14-14: Avoid additional packages by specifying --no-install-recommends

(DL3015)

[info] 14-14: Delete the apt-get lists after installing something

(DL3009)

[warning] 14-14: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

[warning] 24-24: Pin versions in pip. Instead of pip install <package> use pip install <package>==<version> or pip install --requirement <requirements file>

(DL3013)

[warning] 24-24: Avoid use of cache directory with pip. Use pip install --no-cache-dir <package>

(DL3042)

🪛 Ruff (0.12.2)

assets/presto_parquet/main.py

72-72: Missing return type annotation for public function hive_execute

(ANN201)

72-72: Missing type annotation for function argument query

(ANN001)

72-72: Boolean default positional argument in function definition

(FBT002)

72-72: Missing type annotation for function argument check

(ANN001)

75-75: Missing return type annotation for public function ingest

Add return type annotation: None

(ANN201)

assets/elasticsearch/main.py

39-39: Trailing comma missing

Add trailing comma

(COM812)

assets/gzip/main.py

40-40: Function call with shell=True parameter identified, security issue

(S604)

42-42: Missing return type annotation for public function search

Add return type annotation: int

(ANN201)

42-42: Missing type annotation for function argument query

(ANN001)

46-46: Use of assert detected

(S101)

48-48: Function call with shell=True parameter identified, security issue

(S604)

50-50: Missing return type annotation for public function bench_search

Add return type annotation: None

(ANN201)

50-50: Boolean default positional argument in function definition

(FBT002)

50-50: Missing type annotation for function argument cold

(ANN001)

56-56: Function call with shell=True parameter identified, security issue

(S604)

assets/elasticsearch/ingest.py

13-13: Missing return type annotation for public function pop_by_path

(ANN201)

13-13: Missing type annotation for function argument obj

(ANN001)

13-13: Missing type annotation for function argument path

(ANN001)

81-81: Missing return type annotation for public function ingest_dataset

Add return type annotation: None

(ANN201)

89-89: Trailing comma missing

Add trailing comma

(COM812)

96-96: Trailing comma missing

Add trailing comma

(COM812)

97-97: Trailing comma missing

Add trailing comma

(COM812)

98-98: Trailing comma missing

Add trailing comma

(COM812)

99-99: Trailing comma missing

Add trailing comma

(COM812)

101-101: Trailing comma missing

Add trailing comma

(COM812)

scripts/benchall.py

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)

4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)

5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)

6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)

7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)

9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)

10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)

11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)

12-12: src.jsonsync.JsonItem imported but unused

Remove unused import: src.jsonsync.JsonItem

(F401)

56-56: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)

56-56: Missing type annotation for function argument bencher

(ANN001)

56-56: Missing type annotation for function argument kwargs

(ANN001)

56-56: Missing type annotation for function argument bench_target

(ANN001)

56-56: Boolean default positional argument in function definition

(FBT002)

56-56: Missing type annotation for function argument attach

(ANN001)

56-56: Boolean default positional argument in function definition

(FBT002)

56-56: Missing type annotation for function argument attach_on_error

(ANN001)

70-70: Do not catch blind exception: Exception

(BLE001)

71-71: datetime.datetime.now() called without a tz argument

(DTZ005)

71-71: Use explicit conversion flag

Replace with conversion flag

(RUF010)

85-86: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🔇 Additional comments (20)

assets/presto_clp/main.py (1)

19-19: LGTM: Class naming standardization

The class renaming from clp_presto_bench to presto_clp_bench improves consistency with the directory structure and aligns with similar standardization efforts across other benchmark modules.

Also applies to: 105-105

assets/clickhouse/main.py (1)

16-16: LGTM: Simplified class naming

The class renaming from clickhouse_native_json_bench to clickhouse_bench simplifies the naming while maintaining clarity, consistent with the broader naming standardization effort.

Also applies to: 157-157

assets/gzip/Dockerfile (2)

5-5: Good: Migration to official base image

Moving from the custom base image to the official ubuntu:jammy image improves maintainability and reduces dependencies on custom infrastructure.

9-9: Intentional use of --allow-downgrades flag

The --allow-downgrades flag is intentionally used for benchmarking consistency, as noted in previous feedback. This ensures benchmarks can run on specified package versions even when Ubuntu has newer versions available.

assets/elasticsearch/Dockerfile (2)

1-1: Good: Consistent base image migration

Moving to the official ubuntu:jammy base image aligns with similar changes across other Dockerfiles and improves maintainability.

11-12: LGTM: Elasticsearch version upgrade to 9.0.3

The coordinated upgrade of the Elasticsearch APT repository (9.x), package version (9.0.3), and Python client (9.0.3) properly implements the main objective of this PR. Version pinning ensures reproducible builds.

Also applies to: 16-16, 24-24

assets/elasticsearch/ingest.py (4)

13-19: Good: Utility function for nested value extraction

The pop_by_path function cleanly handles extraction of nested JSON values using dot notation, which supports the timestamp extraction feature.

67-72: LGTM: Robust timestamp extraction

The timestamp extraction logic properly handles cases where the specified path doesn't exist by falling back to a default value of 0. This supports the new logs data stream functionality.

84-103: Excellent: Index template for logs data stream

The index template configuration with "mode": "logsdb" and proper @timestamp field mapping aligns perfectly with Elasticsearch 9.x best practices for log data ingestion. The flexible date format handling and priority setting ensure proper template application.

76-76: Good: Explicit operation type specification

Setting _op_type to "create" ensures documents are only created and not updated, which is appropriate for log ingestion scenarios.

assets/elasticsearch/methodology.md (2)

3-3: LGTM! Version update aligns with PR objectives.

The Elasticsearch version upgrade from 8.6.2 to 9.0.3 is consistent with the PR's goal to modernize the tooling.

19-20: Good addition! Clarifies the ingestion approach.

The note about using logs data stream provides valuable context for users understanding the optimized timestamped log handling approach.

assets/gzip/main.py (3)

40-40: LGTM! Shell redirection requires shell=True.

The addition of shell=True is necessary for the shell redirection (>) to work properly with the docker_execute method.

48-48: LGTM! Consistent with shell redirection requirements.

The shell=True parameter is correctly added for the shell command with redirection.

56-56: LGTM! Proper shell command execution.

The shell=True parameter ensures the gunzip command with redirection executes correctly within the Docker container.
assets/elasticsearch/main.py (1)

39-39: Action Required: Confirm timestamp in dataset metadata

The ingestion pipeline now passes dataset_meta['timestamp'] to ingest.py (which defaults missing values to 0). Before merging, please verify that:

Every dataset definition loaded by src/template sets a valid timestamp key.

Your metadata files (e.g. JSON under your metadata directory) contain the timestamp field at the correct JSON path.

You can run, for example:
fd -e json path/to/your/metadata | xargs -I {} jq 'if has("timestamp") then "✓ {}" else "✗ {}" end' {}
scripts/benchall.py (4)

37-54: LGTM! Focused testing approach for Elasticsearch upgrade.

Commenting out other benchmarks to focus on Elasticsearch testing is appropriate for this PR's scope.

56-56: LGTM! Improved error handling with attach_on_error parameter.

The addition of attach_on_error parameter provides better debugging capabilities when benchmarks fail.

75-75: LGTM! Enhanced exception handling logic.

The updated condition if attach or attach_on_error properly implements the new debugging feature.

85-89: LGTM! Useful dataset filtering capability.

The command-line dataset filtering and attach_on_error=True usage improve the script's flexibility and debugging capabilities.

coderabbitai · 2025-08-04T06:58:47Z

assets/clickhouse/Dockerfile

@@ -1,2 +1 @@
-FROM clickhouse/clickhouse-server:latest
-
+FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f


🧹 Nitpick (assertive)

Good pinning; add HEALTHCHECK & non-root user

Pinning the ClickHouse image by digest is 👍. Consider adding:

A non-root user to follow container hardening guidelines.

A simple HEALTHCHECK so orchestration tools can detect crash-loops.

Example:

FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f + +# Create non-root user +RUN useradd --system --uid 1001 clickhouse +USER clickhouse + +# Basic liveness probe +HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \ + clickhouse-client --query "SELECT 1" || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f

FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f

# Create non-root user

RUN useradd --system --uid 1001 clickhouse

USER clickhouse

# Basic liveness probe

HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \

clickhouse-client --query "SELECT 1" || exit 1

🧰 Tools

🪛 Checkov (3.2.334)

[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

[LOW] 1-1: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🤖 Prompt for AI Agents

In assets/clickhouse/Dockerfile at line 1, the base image is pinned by digest which is good, but the Dockerfile lacks a HEALTHCHECK instruction and does not switch to a non-root user. To fix this, add a HEALTHCHECK command that periodically verifies the ClickHouse server is responsive, and create or switch to a non-root user before running the server to improve container security and comply with best practices.

coderabbitai · 2025-08-04T06:58:48Z

assets/clickhouse/methodology.md


 **Clearing Caches:**  
-Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches. 
+Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches. 


🧹 Nitpick (assertive)

Minor typo: possessive “ClickHouse’s” vs plural

“ClickHouse’s provided docker images” → “ClickHouse-provided Docker images” for clarity.
Very small nit—feel free to ignore.

🤖 Prompt for AI Agents

In assets/clickhouse/methodology.md at line 70, change the phrase "ClickHouse’s provided docker images" to "ClickHouse-provided Docker images" to correct the possessive typo and improve clarity. Also capitalize "Docker" as it is a proper noun.

coderabbitai · 2025-08-04T06:58:48Z

assets/clp/main.py

 CLP_OUT_PATH = f"{WORK_DIR}/archives"
 CLP_S_BINARY = "/clp/clp-s"
-class clp_s_bench(Benchmark):
+class clp_bench(Benchmark):


🧹 Nitpick (assertive)

Rename class to PascalCase to respect PEP-8

PEP-8 prescribes CapWords for class names. Renaming clp_bench → ClpBench will align this file with the rest of the codebase and with other renamed benchmark classes.

-class clp_bench(Benchmark): +class ClpBench(Benchmark):

You will also need to adjust the instantiation on Line 64 and any external imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class clp_bench(Benchmark):

class ClpBench(Benchmark):

🤖 Prompt for AI Agents

In assets/clp/main.py at line 8, rename the class from clp_bench to ClpBench to follow PEP-8 PascalCase naming conventions for classes. Also update any instantiations of this class around line 64 and adjust any external imports or references to use the new class name ClpBench.

coderabbitai · 2025-08-04T06:58:48Z

assets/clp/main.py


 def main():
-    bench = clp_s_bench(sys.argv[1])
+    bench = clp_bench(sys.argv[1])


🧹 Nitpick (assertive)

Instantiation must match updated class name

If you adopt the PascalCase rename, remember to update the constructor here:

- bench = clp_bench(sys.argv[1]) + bench = ClpBench(sys.argv[1])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

bench = clp_bench(sys.argv[1])

bench = ClpBench(sys.argv[1])

🤖 Prompt for AI Agents

In assets/clp/main.py at line 64, the instantiation of the class uses the old class name clp_bench. Update this line to use the new PascalCase class name by replacing clp_bench with ClpBench to match the renamed class constructor.

coderabbitai · 2025-08-04T06:58:48Z

assets/elasticsearch/ingest.py

+def pop_by_path(obj, path):
+    keys = path.split('.')
+    for key in keys[:-1]:
+        obj = obj[key]
+    out = obj[keys[-1]]
+    del obj[keys[-1]]
+    return out


🧹 Nitpick (assertive)

Optional: Add type annotations for better code documentation

Consider adding type annotations to improve code readability and enable better IDE support.

-def pop_by_path(obj, path): +def pop_by_path(obj: dict, path: str) -> any:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def pop_by_path(obj, path):

keys = path.split('.')

for key in keys[:-1]:

obj = obj[key]

out = obj[keys[-1]]

del obj[keys[-1]]

return out

def pop_by_path(obj: dict, path: str) -> any:

keys = path.split('.')

for key in keys[:-1]:

obj = obj[key]

out = obj[keys[-1]]

del obj[keys[-1]]

return out

🧰 Tools

🪛 Ruff (0.12.2)

13-13: Missing return type annotation for public function pop_by_path

(ANN201)

13-13: Missing type annotation for function argument obj

(ANN001)

13-13: Missing type annotation for function argument path

(ANN001)

🤖 Prompt for AI Agents

In assets/elasticsearch/ingest.py around lines 13 to 19, the function pop_by_path lacks type annotations. Add appropriate type hints for the function parameters and return type, such as specifying that obj is a dictionary with string keys and any type of values, path is a string, and the return type is any, to improve code readability and IDE support.

coderabbitai · 2025-08-04T06:58:48Z

assets/presto_parquet/include/etc_worker/config.properties

@@ -1,5 +1,5 @@
 discovery.uri=http://127.0.0.1:8080
-presto.version=0.293-SNAPSHOT-faae543
+presto.version=0.293-89ce0f3


🧹 Nitpick (assertive)

Version string now diverges from directory naming used elsewhere

presto.version embeds the commit hash (0.293-89ce0f3), while the paths hard-coded in main.py (Lines 66 & 72) refer to presto-server-0.293 and presto-cli-0.293-executable.jar. This mismatch is harmless at runtime but will cause confusion when the next bump happens.

Recommend either:

Derive the runtime paths from this property, or

Amend the property to exactly match the artefact directory name.

This keeps observability tools and grep searches consistent.

🤖 Prompt for AI Agents

In assets/presto_parquet/include/etc_worker/config.properties at line 2, the presto.version property includes a commit hash making it differ from the directory names used in main.py lines 66 and 72. To fix this, either update presto.version to only include the version number matching the directory names (e.g., "0.293") or modify main.py to derive the directory paths dynamically from the full presto.version property to ensure consistency and avoid confusion in observability and searches.

coderabbitai · 2025-08-04T06:58:48Z

assets/presto_parquet/main.py

+        self.docker_execute("bash -c \"python3 /home/presto/presto-server/target/presto-server-0.293/bin/launcher.py run --etc-dir=/home/include/etc_coordinator\"", background=True)
        self.wait_for_port(8080)
-        self.docker_execute("/home/presto/presto-native-execution/build/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
+        self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
        self.wait_for_port(7777)
        time.sleep(60)  # this needs to be more than 10


🛠️ Refactor suggestion

Hard-coded versioned paths make future bumps brittle

/home/presto/presto-server/target/presto-server-0.293/... is embedded directly in the command. The same string occurs in multiple places. Every future upgrade will require hunting for these literals.

-"/home/presto/presto-server/target/presto-server-0.293/bin/launcher.py" +f"/home/presto/presto-server/target/presto-server-{self.properties['presto_version']}/bin/launcher.py"

Populate presto_version once (e.g., from etc_worker/config.properties) and reuse it.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In assets/presto_parquet/main.py around lines 66 to 70, the presto server version "0.293" is hard-coded in multiple command strings, making future upgrades error-prone. To fix this, read the presto_version once from etc_worker/config.properties or a similar config file, store it in a variable, and then use that variable to construct the versioned paths dynamically in the docker_execute commands. This eliminates repeated literals and simplifies version bumps.

coderabbitai · 2025-08-04T06:58:48Z

assets/presto_parquet/main.py

+        self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
        self.wait_for_port(7777)
        time.sleep(60)  # this needs to be more than 10


🧹 Nitpick (assertive)

Repeated polling sleep hides readiness issues

A fixed time.sleep(60) after launching the native engine masks startup failures and slows fast boots. Prefer an explicit health probe loop (e.g., curl v1/info or check the TCP port) with a timeout.

This will surface crashes early and shave minutes off CI when the service starts quickly.

🤖 Prompt for AI Agents

In assets/presto_parquet/main.py around lines 68 to 70, replace the fixed time.sleep(60) after starting the native engine with a loop that explicitly checks service readiness by polling a health endpoint like v1/info or verifying the TCP port is open, with a timeout to avoid indefinite waiting. This change will detect startup failures promptly and reduce unnecessary delays during fast startups.

coderabbitai · 2025-08-04T06:58:49Z

assets/presto_parquet/main.py

    def hive_execute(self, query, check=True):
-        return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
+        return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)

    def ingest(self):
        """
        Ingests the dataset at self.datasets_path
        """
-        self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
+        self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')



🧹 Nitpick (assertive)

Add type annotations + avoid positional boolean default

Static analysis (Ruff ANN/F-BT) flags hive_execute/ingest. Low-cost improvement:

- def hive_execute(self, query, check=True): + def hive_execute(self, query: str, check: bool = True) -> str: ... - def ingest(self): + def ingest(self) -> None:

Improves IDE autocomplete and enforces the public API.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def hive_execute(self, query, check=True):

return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)

return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)

def ingest(self):

"""

Ingests the dataset at self.datasets_path

"""

self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')

self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')

def hive_execute(self, query: str, check: bool = True) -> str:

return self.docker_execute(

f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '

f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"',

check,

)

def ingest(self) -> None:

"""

Ingests the dataset at self.datasets_path

"""

self.docker_execute(

f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '

f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute '

f'"CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"',

)

🧰 Tools

🪛 Ruff (0.12.2)

72-72: Missing return type annotation for public function hive_execute

(ANN201)

72-72: Missing type annotation for function argument query

(ANN001)

72-72: Boolean default positional argument in function definition

(FBT002)

72-72: Missing type annotation for function argument check

(ANN001)

75-75: Missing return type annotation for public function ingest

Add return type annotation: None

(ANN201)

🤖 Prompt for AI Agents

In assets/presto_parquet/main.py around lines 72 to 80, the methods hive_execute and ingest lack type annotations and use a positional boolean default parameter. Add explicit type annotations for the method parameters and return types, and change the boolean parameter in hive_execute to a keyword-only argument with a default value to avoid positional boolean defaults. This will improve static analysis, IDE autocomplete, and clarify the public API.

coderabbitai · 2025-08-04T06:58:49Z

scripts/benchall.py

+from assets.clp.main import clp_bench
+from assets.clickhouse.main import clickhouse_bench
 from assets.sparksql.main import sparksql_bench
-from assets.parquet.main import parquet_bench
+from assets.presto_parquet.main import parquet_bench
 from assets.zstandard.main import zstandard_bench
 from assets.elasticsearch.main import elasticsearch_bench
-from assets.clp_presto.main import clp_presto_bench
+from assets.presto_clp.main import presto_clp_bench
 from assets.overhead_test.main import overhead_test_bench
 from assets.gzip.main import gzip_bench


🧹 Nitpick (assertive)

Clean up unused imports.

Most benchmark imports are unused since the corresponding benchmarks are commented out. Consider removing unused imports to keep the code clean.

-from assets.clp.main import clp_bench -from assets.clickhouse.main import clickhouse_bench -from assets.sparksql.main import sparksql_bench -from assets.presto_parquet.main import parquet_bench -from assets.zstandard.main import zstandard_bench from assets.elasticsearch.main import elasticsearch_bench -from assets.presto_clp.main import presto_clp_bench -from assets.overhead_test.main import overhead_test_bench -from assets.gzip.main import gzip_bench -from src.jsonsync import JsonItem

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)

4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)

5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)

6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)

7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)

9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)

10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)

11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)

🤖 Prompt for AI Agents

In scripts/benchall.py between lines 3 and 11, several benchmark imports are unused because their corresponding benchmark calls are commented out. Identify which imported benchmark functions are not used anywhere in the file and remove those import lines to clean up the code and avoid unnecessary dependencies.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

assets/elasticsearch/ingest.py (1)
13-19: Fix error handling for missing intermediate keys.

The function will raise KeyError if any intermediate key doesn't exist, but the caller expects this only for the final key. Consider adding proper error handling for missing intermediate keys.

As noted in the previous review, this function still lacks type annotations.
-def pop_by_path(obj, path):
+def pop_by_path(obj: dict, path: str) -> any:
     keys = path.split('.')
-    for key in keys[:-1]:
-        obj = obj[key]
+    for key in keys[:-1]:
+        if key not in obj:
+            raise KeyError(f"Key '{key}' not found in path '{path}'")
+        obj = obj[key]
+    if keys[-1] not in obj:
+        raise KeyError(f"Key '{keys[-1]}' not found in path '{path}'")
     out = obj[keys[-1]]
     del obj[keys[-1]]
     return out
scripts/benchall.py (1)

3-11: Clean up unused imports.

As noted in the previous review, most benchmark imports are unused since the corresponding benchmarks are commented out in the benchmarks list. The issue is now more pronounced with only elasticsearch benchmarks enabled.

Consider removing unused imports to keep the code clean and avoid unnecessary dependencies.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03772f1 and 5802501.

📒 Files selected for processing (3)

assets/elasticsearch/ingest.py (2 hunks)
assets/elasticsearch/main.py (2 hunks)
scripts/benchall.py (4 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

scripts/benchall.py (5)

assets/elasticsearch/main.py (2)

main (88-90)

elasticsearch_bench (14-85)

assets/clp/main.py (2)

main (63-65)

clp_bench (8-60)

assets/presto_clp/main.py (2)

main (104-106)

presto_clp_bench (19-101)

assets/presto_parquet/main.py (1)

parquet_bench (18-145)

src/jsonsync.py (1)

JsonItem (5-101)

assets/elasticsearch/main.py (1)

src/template.py (1)

docker_execute (240-272)

🪛 Ruff (0.12.2)

scripts/benchall.py

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)

4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)

5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)

6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)

7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)

9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)

10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)

11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)

12-12: src.jsonsync.JsonItem imported but unused

Remove unused import: src.jsonsync.JsonItem

(F401)

57-57: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)

57-57: Missing type annotation for function argument bencher

(ANN001)

57-57: Missing type annotation for function argument kwargs

(ANN001)

57-57: Missing type annotation for function argument bench_target

(ANN001)

57-57: Boolean default positional argument in function definition

(FBT002)

57-57: Missing type annotation for function argument attach

(ANN001)

57-57: Boolean default positional argument in function definition

(FBT002)

57-57: Missing type annotation for function argument attach_on_error

(ANN001)

71-71: Do not catch blind exception: Exception

(BLE001)

72-72: datetime.datetime.now() called without a tz argument

(DTZ005)

72-72: Use explicit conversion flag

Replace with conversion flag

(RUF010)

86-87: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

assets/elasticsearch/ingest.py

13-13: Missing return type annotation for public function pop_by_path

(ANN201)

13-13: Missing type annotation for function argument obj

(ANN001)

13-13: Missing type annotation for function argument path

(ANN001)

81-81: Missing return type annotation for public function ingest_dataset

Add return type annotation: None

(ANN201)

90-90: Trailing comma missing

Add trailing comma

(COM812)

97-97: Trailing comma missing

Add trailing comma

(COM812)

98-98: Trailing comma missing

Add trailing comma

(COM812)

99-99: Trailing comma missing

Add trailing comma

(COM812)

100-100: Trailing comma missing

Add trailing comma

(COM812)

102-102: Trailing comma missing

Add trailing comma

(COM812)

assets/elasticsearch/main.py

16-16: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)

16-16: Missing type annotation for function argument dataset

(ANN001)

16-16: Boolean default positional argument in function definition

(FBT002)

16-16: Missing type annotation for function argument logsdb

(ANN001)

43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb" instead of if-else-block

Replace if-else-block with logsdb = "anything" if self.logsdb else "no_logsdb"

(SIM108)

49-49: Trailing comma missing

Add trailing comma

(COM812)

🔇 Additional comments (3)

assets/elasticsearch/ingest.py (2)

67-72: LGTM!

The timestamp extraction logic is well-implemented with appropriate fallback handling for missing timestamp fields.

76-76: LGTM!

Adding _op_type: "create" is a good practice for data ingestion to prevent accidental overwrites.

scripts/benchall.py (1)

37-55: LGTM!

The benchmark selection appropriately focuses on testing the new elasticsearch logsdb functionality with both True and False values, which aligns with the PR objectives to improve Elasticsearch benchmarking.

coderabbitai · 2025-08-04T13:13:19Z

assets/elasticsearch/ingest.py

+    if sys.argv[3] != "no_logsdb":
+        template_body = {
+            "index_patterns": [collection_name],
+            "template": {
+                "settings": {
+                    "index": {
+                        "mode": "logsdb"
+                    },
+                },
+                "mappings": {
+                    "properties": {
+                        "@timestamp": {
+                            "type": "date",
+                            "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
+                            }
+                    }
+                }
+            },
+            "priority": 101
+        }
+        es.indices.put_index_template(name=collection_name, body=template_body)


🧹 Nitpick (assertive)

LGTM! Consider minor formatting improvements.

The index template logic is well-implemented with comprehensive date format support for the @timestamp field. The conditional creation based on the logsdb flag is appropriate.

Consider adding trailing commas for better diff readability:

"settings": { "index": { - "mode": "logsdb" + "mode": "logsdb", }, }, "mappings": { "properties": { "@timestamp": { "type": "date", - "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz" - } + "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz", + }, } - } + }, }, - "priority": 101 + "priority": 101,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if sys.argv[3] != "no_logsdb":

template_body = {

"index_patterns": [collection_name],

"template": {

"settings": {

"index": {

"mode": "logsdb"

},

},

"mappings": {

"properties": {

"@timestamp": {

"type": "date",

"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"

}

}

}

},

"priority": 101

}

es.indices.put_index_template(name=collection_name, body=template_body)

if sys.argv[3] != "no_logsdb":

template_body = {

"index_patterns": [collection_name],

"template": {

"settings": {

"index": {

"mode": "logsdb",

},

},

"mappings": {

"properties": {

"@timestamp": {

"type": "date",

"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",

},

},

},

},

"priority": 101,

}

es.indices.put_index_template(name=collection_name, body=template_body)

🧰 Tools

🪛 Ruff (0.12.2)

90-90: Trailing comma missing

Add trailing comma

(COM812)

97-97: Trailing comma missing

Add trailing comma

(COM812)

98-98: Trailing comma missing

Add trailing comma

(COM812)

99-99: Trailing comma missing

Add trailing comma

(COM812)

100-100: Trailing comma missing

Add trailing comma

(COM812)

102-102: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents

In assets/elasticsearch/ingest.py around lines 84 to 104, add trailing commas after the last elements in dictionaries and lists within the template_body definition. This includes after the "mode": "logsdb" setting, the "@timestamp" mapping dictionary, and the overall dictionary entries. Adding these trailing commas improves diff readability and consistency in formatting.

coderabbitai · 2025-08-04T13:13:20Z

assets/elasticsearch/main.py

+    def __init__(self, dataset, logsdb=True):
        super().__init__(dataset)

+        self.logsdb = logsdb
+
+        if not logsdb:
+            self.properties["notes"] = "no logsdb"


🧹 Nitpick (assertive)

LGTM! Consider minor improvements for code quality.

The logsdb parameter addition is well-implemented with appropriate default value and property tracking.

Consider addressing the static analysis suggestions:

- def __init__(self, dataset, logsdb=True): + def __init__(self, dataset: str, *, logsdb: bool = True) -> None:

Using keyword-only argument (*) prevents the boolean parameter pitfall flagged by static analysis.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, dataset, logsdb=True):

super().__init__(dataset)

self.logsdb = logsdb

if not logsdb:

self.properties["notes"] = "no logsdb"

def __init__(self, dataset: str, *, logsdb: bool = True) -> None:

super().__init__(dataset)

self.logsdb = logsdb

if not logsdb:

self.properties["notes"] = "no logsdb"

🧰 Tools

🪛 Ruff (0.12.2)

16-16: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)

16-16: Missing type annotation for function argument dataset

(ANN001)

16-16: Boolean default positional argument in function definition

(FBT002)

16-16: Missing type annotation for function argument logsdb

(ANN001)

🤖 Prompt for AI Agents

In assets/elasticsearch/main.py around lines 16 to 22, the logsdb parameter should be made keyword-only to avoid boolean parameter pitfalls flagged by static analysis. Modify the __init__ method signature to include a * before logsdb, making logsdb a keyword-only argument. This improves code clarity and prevents accidental positional argument misuse.

coderabbitai · 2025-08-04T13:13:20Z

assets/elasticsearch/main.py

+        if self.logsdb:
+            logsdb = "anything"
+        else:
+            logsdb = "no_logsdb"
+
        self.docker_execute([
-            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"
+            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
            ])


🧹 Nitpick (assertive)

LGTM! Consider simplifying the conditional logic.

The updated ingest method correctly passes the required arguments to the ingestion script, including the new logsdb flag and timestamp path.

Consider simplifying the conditional logic and adding trailing comma:

- if self.logsdb: - logsdb = "anything" - else: - logsdb = "no_logsdb" + logsdb = "anything" if self.logsdb else "no_logsdb" self.docker_execute([ - f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}" + f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}", ])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.logsdb:

logsdb = "anything"

else:

logsdb = "no_logsdb"

self.docker_execute([

f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"

f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"

])

logsdb = "anything" if self.logsdb else "no_logsdb"

self.docker_execute([

f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",

])

🧰 Tools

🪛 Ruff (0.12.2)

43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb" instead of if-else-block

Replace if-else-block with logsdb = "anything" if self.logsdb else "no_logsdb"

(SIM108)

49-49: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents

In assets/elasticsearch/main.py around lines 43 to 50, simplify the conditional assignment of logsdb by using a single line conditional expression instead of an if-else block. Also, add a trailing comma to the list passed to self.docker_execute for consistency and style.

coderabbitai · 2025-08-04T13:13:20Z

scripts/benchall.py

    ]

-def run(bencher, kwargs, bench_target, attach=False):
+def run(bencher, kwargs, bench_target, attach=False, attach_on_error=False):


🧹 Nitpick (assertive)

LGTM! The attach_on_error parameter enhances debugging capabilities.

The new parameter provides useful flexibility for debugging benchmark failures without requiring full attachment mode.

Consider addressing the static analysis suggestions for boolean parameters and type annotations to improve code quality.

🧰 Tools

🪛 Ruff (0.12.2)

57-57: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)

57-57: Missing type annotation for function argument bencher

(ANN001)

57-57: Missing type annotation for function argument kwargs

(ANN001)

57-57: Missing type annotation for function argument bench_target

(ANN001)

57-57: Boolean default positional argument in function definition

(FBT002)

57-57: Missing type annotation for function argument attach

(ANN001)

57-57: Boolean default positional argument in function definition

(FBT002)

57-57: Missing type annotation for function argument attach_on_error

(ANN001)

🤖 Prompt for AI Agents

In scripts/benchall.py at line 57, the run function is missing type annotations for its boolean parameters. Add type annotations for the attach and attach_on_error parameters, specifying them as bool, and ensure all other parameters are also properly annotated for clarity and static analysis compliance.

coderabbitai · 2025-08-04T13:13:20Z

scripts/benchall.py

+        if len(sys.argv) > 1:
+            if dataset_name != sys.argv[1].strip():
+                continue


🧹 Nitpick (assertive)

LGTM! Consider simplifying the conditional logic.

The dataset filtering capability is a useful addition for targeted benchmark execution.

Consider combining the nested if statements as suggested by static analysis:

- if len(sys.argv) > 1: - if dataset_name != sys.argv[1].strip(): - continue + if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip(): + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(sys.argv) > 1:

if dataset_name != sys.argv[1].strip():

continue

if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip():

continue

🧰 Tools

🪛 Ruff (0.12.2)

86-87: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🤖 Prompt for AI Agents

In scripts/benchall.py at lines 86 to 88, the nested if statements checking sys.argv length and comparing dataset_name can be combined into a single conditional statement; refactor these lines to use a single if statement that performs both checks together for simplicity.

Bill-hbrhbr

Need to break up the PR

PacificViking · 2025-09-29T13:15:39Z

Need to break up the PR

Sorry, this PR was not cherry-picked correctly.

lock clickhouse version and older commits are already merged.

improve benchall.py only running on one dataset is an irrelevant and short change.

elasticsearch changes and later commits are part of the PR.

I don't know how to fix this issue without redoing the cherry pick and force pushing, but I'd like confirmation before I do this.

Bill-hbrhbr · 2025-09-29T14:12:44Z

Need to break up the PR

Sorry, this PR was not cherry-picked correctly.

lock clickhouse version and older commits are already merged.

improve benchall.py only running on one dataset is an irrelevant and short change.

elasticsearch changes and later commits are part of the PR.

I don't know how to fix this issue without redoing the cherry pick and force pushing, but I'd like confirmation before I do this.

Hey John @PacificViking, I'm marking this PR as change requested because we cannot immediately merge this PR. Now I see where this PR is coming from, but still some parts of it require some more refactoring.

I can take it from here. If there's anything I need to ask you, I will DM you. Thanks again for the explanations.

PacificViking added 9 commits July 24, 2025 17:46

fix gzip by executing redirect in shell

cd6f464

change ubuntu references to ubuntu:jammy

0a4e6d5

fix assets renaming

61d181a

presto_parquet using yscope presto

ce51b99

lock clickhouse version

63eb6bb

improve benchall.py only running on one dataset

c163497

elasticsearch changes

b5f7edb

working logsdb

cad3f9d

working all changes for elasticsearch (9.0 and logsdb)

03772f1

PacificViking requested a review from a team as a code owner August 4, 2025 06:52

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

add no logsdb option

5802501

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

Bill-hbrhbr requested changes Sep 24, 2025

View reviewed changes

		@@ -1,2 +1 @@
		FROM clickhouse/clickhouse-server:latest

		FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f

-FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
+FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
+# Create non-root user
+RUN useradd --system --uid 1001 clickhouse
+USER clickhouse
+# Basic liveness probe
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
+  clickhouse-client --query "SELECT 1" || exit 1

feat: Bump elasticsearch version and improve methodology. #9

Are you sure you want to change the base?

feat: Bump elasticsearch version and improve methodology. #9

Uh oh!

Conversation

PacificViking commented Aug 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr left a comment

Choose a reason for hiding this comment

Uh oh!

PacificViking commented Sep 29, 2025

PacificViking commented Aug 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 4, 2025 •

edited

Loading