Skip to content

Conversation

PacificViking
Copy link
Contributor

@PacificViking PacificViking commented Aug 4, 2025

Description

Improves the outdated version and uses logsdb methodology as mentioned in #7 .

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

New elasticsearch benchmark runs.

Summary by CodeRabbit

New Features

  • Added support for extracting and mapping timestamp fields during Elasticsearch ingestion.
  • Enhanced error handling in benchmark orchestration with optional attach-on-error functionality.
  • Added command-line filtering to run benchmarks on specific datasets.

Bug Fixes

  • Ensured correct handling of shell commands with output redirection in Gzip benchmarking.

Refactor

  • Updated class names for consistency across benchmarking modules.
  • Updated Presto and ClickHouse references to specific versions and improved Docker image reproducibility.
  • Modified paths and configuration to match updated Presto and Elasticsearch versions.
  • Upgraded Elasticsearch to version 9.0.3 and switched to official base images for multiple components.
  • Improved Dockerfiles by standardizing base images and package versions.

Documentation

  • Updated methodology documents to reflect new software versions and features.

@PacificViking PacificViking requested a review from a team as a code owner August 4, 2025 06:52
Copy link

coderabbitai bot commented Aug 4, 2025

Walkthrough

This update revises Dockerfiles to use official or version-pinned base images, upgrades ClickHouse and Elasticsearch versions, and updates corresponding documentation. Several benchmark class names are renamed for consistency. The Elasticsearch ingestion pipeline is enhanced to support timestamp extraction and index template application. The benchmark runner script is refactored for improved error handling and dataset selection.

Changes

Cohort / File(s) Change Summary
ClickHouse Version Pinning & Naming
assets/clickhouse/Dockerfile, assets/clickhouse/main.py, assets/clickhouse/methodology.md
Dockerfile now pins ClickHouse to a specific version (25.6) by digest. The methodology doc and code references are updated to match. Benchmark class renamed from clickhouse_native_json_bench to clickhouse_bench.
CLP Benchmark Naming
assets/clp/main.py
Benchmark class renamed from clp_s_bench to clp_bench.
Elasticsearch Upgrade & Ingestion Enhancements
assets/elasticsearch/Dockerfile, assets/elasticsearch/ingest.py, assets/elasticsearch/main.py, assets/elasticsearch/methodology.md
Elasticsearch upgraded from 8.6.2 to 9.0.3; Dockerfile uses official Ubuntu base. Ingestion script adds timestamp extraction via path, index template creation, and _op_type field. Methodology doc updated for version and logs data stream note. Ingestion method now passes timestamp path and logsdb flag.
Gzip Dockerfile & Shell Execution
assets/gzip/Dockerfile, assets/gzip/main.py
Dockerfile uses official Ubuntu base and allows downgrades. All shell commands in main.py now explicitly use shell=True for proper redirection.
Overhead Test & Template Dockerfiles
assets/overhead_test/Dockerfile, assets/template/Dockerfile
Both Dockerfiles now use the official Ubuntu base image instead of a custom one.
Presto CLP Benchmark Naming
assets/presto_clp/main.py
Benchmark class renamed from clp_presto_bench to presto_clp_bench.
Presto Parquet Versioning and Paths
assets/presto_parquet/Dockerfile, assets/presto_parquet/include/etc_worker/config.properties, assets/presto_parquet/main.py
Dockerfile uses a specific Presto version and commit hashes, updates build process and dependencies. Config and code paths updated to match new Presto version/build layout.
Benchmark Runner Refactor
scripts/benchall.py
Benchmark imports updated for new class names. Only the Elasticsearch benchmark is enabled. Adds dataset name filtering, improved error handling with attach_on_error, and updates the run function signature.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant benchall.py
    participant BenchmarkClass
    participant DockerContainer

    User->>benchall.py: Run script with optional dataset argument
    benchall.py->>BenchmarkClass: Initialize benchmark (e.g., elasticsearch_bench)
    loop For each dataset
        benchall.py->>BenchmarkClass: run(bench_target, attach, attach_on_error)
        BenchmarkClass->>DockerContainer: Execute ingestion/search commands
        alt Error occurs
            benchall.py->>DockerContainer: Attach if attach_on_error is True
        end
    end
Loading
sequenceDiagram
    participant ingest.py
    participant Elasticsearch
    participant Dataset

    ingest.py->>Dataset: Read each JSON line
    ingest.py->>ingest.py: pop_by_path(line, timestamp_path)
    alt Timestamp found
        ingest.py->>ingest.py: Set @timestamp to extracted value
    else Not found
        ingest.py->>ingest.py: Set @timestamp to 0
    end
    ingest.py->>Elasticsearch: Create index template (logsdb, mapping, priority)
    ingest.py->>Elasticsearch: Bulk ingest with _op_type="create"
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • kirkrodrigues

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🔭 Outside diff range comments (1)
assets/clickhouse/methodology.md (1)

3-11: Documentation still references the latest tag—update for version 25.6

Lines 9-11 state the benchmark uses clickhouse/clickhouse-server:latest, which conflicts with the newly-pinned 25.6 image (and could mislead future maintainers).

Please replace the reference with the exact tag/digest used in assets/clickhouse/Dockerfile to keep the methodology accurate.

♻️ Duplicate comments (1)
assets/template/Dockerfile (1)

5-5: Same dependency check & image-size clean-up applies here

The observations made for assets/overhead_test/Dockerfile equally apply to this template Dockerfile: verify run-time deps did not vanish and purge the apt lists after installation to avoid a bloated benchmark image.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22783a0 and 03772f1.

📒 Files selected for processing (17)
  • assets/clickhouse/Dockerfile (1 hunks)
  • assets/clickhouse/main.py (2 hunks)
  • assets/clickhouse/methodology.md (2 hunks)
  • assets/clp/main.py (2 hunks)
  • assets/elasticsearch/Dockerfile (2 hunks)
  • assets/elasticsearch/ingest.py (2 hunks)
  • assets/elasticsearch/main.py (1 hunks)
  • assets/elasticsearch/methodology.md (2 hunks)
  • assets/gzip/Dockerfile (1 hunks)
  • assets/gzip/main.py (1 hunks)
  • assets/overhead_test/Dockerfile (1 hunks)
  • assets/presto_clp/main.py (2 hunks)
  • assets/presto_parquet/Dockerfile (2 hunks)
  • assets/presto_parquet/include/etc_worker/config.properties (1 hunks)
  • assets/presto_parquet/main.py (1 hunks)
  • assets/template/Dockerfile (1 hunks)
  • scripts/benchall.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.
📚 Learning: in benchmarking docker files, the --allow-downgrades flag in apt-get install commands is intentional...
Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.

Applied to files:

  • assets/overhead_test/Dockerfile
  • assets/template/Dockerfile
  • assets/presto_parquet/Dockerfile
  • assets/gzip/Dockerfile
🧬 Code Graph Analysis (4)
assets/clp/main.py (1)
src/template.py (1)
  • Benchmark (55-420)
assets/presto_clp/main.py (1)
src/template.py (1)
  • Benchmark (55-420)
assets/clickhouse/main.py (1)
src/template.py (1)
  • Benchmark (55-420)
assets/gzip/main.py (1)
src/template.py (2)
  • docker_execute (240-272)
  • bench_start (293-307)
🪛 Checkov (3.2.334)
assets/clickhouse/Dockerfile

[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-1: Ensure that a user for the container has been created

(CKV_DOCKER_3)

assets/presto_parquet/Dockerfile

[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-62: Ensure that a user for the container has been created

(CKV_DOCKER_3)


[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-62: Ensure that a user for the container has been created

(CKV_DOCKER_3)

assets/elasticsearch/Dockerfile

[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-24: Ensure that a user for the container has been created

(CKV_DOCKER_3)


[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-24: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🪛 Hadolint (2.12.0)
assets/presto_parquet/Dockerfile

[info] 43-43: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)


[info] 44-44: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)


[info] 58-58: Delete the apt-get lists after installing something

(DL3009)


[warning] 60-60: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)


[info] 60-60: Avoid additional packages by specifying --no-install-recommends

(DL3015)


[info] 60-60: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)

assets/gzip/Dockerfile

[warning] 8-8: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)


[info] 8-8: Delete the apt-get lists after installing something

(DL3009)


[info] 8-8: Avoid additional packages by specifying --no-install-recommends

(DL3015)

assets/elasticsearch/Dockerfile

[info] 14-14: Avoid additional packages by specifying --no-install-recommends

(DL3015)


[info] 14-14: Delete the apt-get lists after installing something

(DL3009)


[warning] 14-14: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)


[warning] 24-24: Pin versions in pip. Instead of pip install <package> use pip install <package>==<version> or pip install --requirement <requirements file>

(DL3013)


[warning] 24-24: Avoid use of cache directory with pip. Use pip install --no-cache-dir <package>

(DL3042)

🪛 Ruff (0.12.2)
assets/presto_parquet/main.py

72-72: Missing return type annotation for public function hive_execute

(ANN201)


72-72: Missing type annotation for function argument query

(ANN001)


72-72: Boolean default positional argument in function definition

(FBT002)


72-72: Missing type annotation for function argument check

(ANN001)


75-75: Missing return type annotation for public function ingest

Add return type annotation: None

(ANN201)

assets/elasticsearch/main.py

39-39: Trailing comma missing

Add trailing comma

(COM812)

assets/gzip/main.py

40-40: Function call with shell=True parameter identified, security issue

(S604)


42-42: Missing return type annotation for public function search

Add return type annotation: int

(ANN201)


42-42: Missing type annotation for function argument query

(ANN001)


46-46: Use of assert detected

(S101)


48-48: Function call with shell=True parameter identified, security issue

(S604)


50-50: Missing return type annotation for public function bench_search

Add return type annotation: None

(ANN201)


50-50: Boolean default positional argument in function definition

(FBT002)


50-50: Missing type annotation for function argument cold

(ANN001)


56-56: Function call with shell=True parameter identified, security issue

(S604)

assets/elasticsearch/ingest.py

13-13: Missing return type annotation for public function pop_by_path

(ANN201)


13-13: Missing type annotation for function argument obj

(ANN001)


13-13: Missing type annotation for function argument path

(ANN001)


81-81: Missing return type annotation for public function ingest_dataset

Add return type annotation: None

(ANN201)


89-89: Trailing comma missing

Add trailing comma

(COM812)


96-96: Trailing comma missing

Add trailing comma

(COM812)


97-97: Trailing comma missing

Add trailing comma

(COM812)


98-98: Trailing comma missing

Add trailing comma

(COM812)


99-99: Trailing comma missing

Add trailing comma

(COM812)


101-101: Trailing comma missing

Add trailing comma

(COM812)

scripts/benchall.py

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)


4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)


5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)


6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)


7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)


9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)


10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)


11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)


12-12: src.jsonsync.JsonItem imported but unused

Remove unused import: src.jsonsync.JsonItem

(F401)


56-56: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)


56-56: Missing type annotation for function argument bencher

(ANN001)


56-56: Missing type annotation for function argument kwargs

(ANN001)


56-56: Missing type annotation for function argument bench_target

(ANN001)


56-56: Boolean default positional argument in function definition

(FBT002)


56-56: Missing type annotation for function argument attach

(ANN001)


56-56: Boolean default positional argument in function definition

(FBT002)


56-56: Missing type annotation for function argument attach_on_error

(ANN001)


70-70: Do not catch blind exception: Exception

(BLE001)


71-71: datetime.datetime.now() called without a tz argument

(DTZ005)


71-71: Use explicit conversion flag

Replace with conversion flag

(RUF010)


85-86: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🔇 Additional comments (20)
assets/presto_clp/main.py (1)

19-19: LGTM: Class naming standardization

The class renaming from clp_presto_bench to presto_clp_bench improves consistency with the directory structure and aligns with similar standardization efforts across other benchmark modules.

Also applies to: 105-105

assets/clickhouse/main.py (1)

16-16: LGTM: Simplified class naming

The class renaming from clickhouse_native_json_bench to clickhouse_bench simplifies the naming while maintaining clarity, consistent with the broader naming standardization effort.

Also applies to: 157-157

assets/gzip/Dockerfile (2)

5-5: Good: Migration to official base image

Moving from the custom base image to the official ubuntu:jammy image improves maintainability and reduces dependencies on custom infrastructure.


9-9: Intentional use of --allow-downgrades flag

The --allow-downgrades flag is intentionally used for benchmarking consistency, as noted in previous feedback. This ensures benchmarks can run on specified package versions even when Ubuntu has newer versions available.

assets/elasticsearch/Dockerfile (2)

1-1: Good: Consistent base image migration

Moving to the official ubuntu:jammy base image aligns with similar changes across other Dockerfiles and improves maintainability.


11-12: LGTM: Elasticsearch version upgrade to 9.0.3

The coordinated upgrade of the Elasticsearch APT repository (9.x), package version (9.0.3), and Python client (9.0.3) properly implements the main objective of this PR. Version pinning ensures reproducible builds.

Also applies to: 16-16, 24-24

assets/elasticsearch/ingest.py (4)

13-19: Good: Utility function for nested value extraction

The pop_by_path function cleanly handles extraction of nested JSON values using dot notation, which supports the timestamp extraction feature.


67-72: LGTM: Robust timestamp extraction

The timestamp extraction logic properly handles cases where the specified path doesn't exist by falling back to a default value of 0. This supports the new logs data stream functionality.


84-103: Excellent: Index template for logs data stream

The index template configuration with "mode": "logsdb" and proper @timestamp field mapping aligns perfectly with Elasticsearch 9.x best practices for log data ingestion. The flexible date format handling and priority setting ensure proper template application.


76-76: Good: Explicit operation type specification

Setting _op_type to "create" ensures documents are only created and not updated, which is appropriate for log ingestion scenarios.

assets/elasticsearch/methodology.md (2)

3-3: LGTM! Version update aligns with PR objectives.

The Elasticsearch version upgrade from 8.6.2 to 9.0.3 is consistent with the PR's goal to modernize the tooling.


19-20: Good addition! Clarifies the ingestion approach.

The note about using logs data stream provides valuable context for users understanding the optimized timestamped log handling approach.

assets/gzip/main.py (3)

40-40: LGTM! Shell redirection requires shell=True.

The addition of shell=True is necessary for the shell redirection (>) to work properly with the docker_execute method.


48-48: LGTM! Consistent with shell redirection requirements.

The shell=True parameter is correctly added for the shell command with redirection.


56-56: LGTM! Proper shell command execution.

The shell=True parameter ensures the gunzip command with redirection executes correctly within the Docker container.

assets/elasticsearch/main.py (1)

39-39: Action Required: Confirm timestamp in dataset metadata

The ingestion pipeline now passes dataset_meta['timestamp'] to ingest.py (which defaults missing values to 0). Before merging, please verify that:

  • Every dataset definition loaded by src/template sets a valid timestamp key.
  • Your metadata files (e.g. JSON under your metadata directory) contain the timestamp field at the correct JSON path.

You can run, for example:

fd -e json path/to/your/metadata | xargs -I {} jq 'if has("timestamp") then "✓ {}" else "✗ {}" end' {}
scripts/benchall.py (4)

37-54: LGTM! Focused testing approach for Elasticsearch upgrade.

Commenting out other benchmarks to focus on Elasticsearch testing is appropriate for this PR's scope.


56-56: LGTM! Improved error handling with attach_on_error parameter.

The addition of attach_on_error parameter provides better debugging capabilities when benchmarks fail.


75-75: LGTM! Enhanced exception handling logic.

The updated condition if attach or attach_on_error properly implements the new debugging feature.


85-89: LGTM! Useful dataset filtering capability.

The command-line dataset filtering and attach_on_error=True usage improve the script's flexibility and debugging capabilities.

@@ -1,2 +1 @@
FROM clickhouse/clickhouse-server:latest

FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Good pinning; add HEALTHCHECK & non-root user

Pinning the ClickHouse image by digest is 👍. Consider adding:

  1. A non-root user to follow container hardening guidelines.
  2. A simple HEALTHCHECK so orchestration tools can detect crash-loops.

Example:

 FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
+
+# Create non-root user
+RUN useradd --system --uid 1001 clickhouse
+USER clickhouse
+
+# Basic liveness probe
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
+  clickhouse-client --query "SELECT 1" || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
# Create non-root user
RUN useradd --system --uid 1001 clickhouse
USER clickhouse
# Basic liveness probe
HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
clickhouse-client --query "SELECT 1" || exit 1
🧰 Tools
🪛 Checkov (3.2.334)

[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-1: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🤖 Prompt for AI Agents
In assets/clickhouse/Dockerfile at line 1, the base image is pinned by digest
which is good, but the Dockerfile lacks a HEALTHCHECK instruction and does not
switch to a non-root user. To fix this, add a HEALTHCHECK command that
periodically verifies the ClickHouse server is responsive, and create or switch
to a non-root user before running the server to improve container security and
comply with best practices.


**Clearing Caches:**
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches.
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor typo: possessive “ClickHouse’s” vs plural

“ClickHouse’s provided docker images” → “ClickHouse-provided Docker images” for clarity.
Very small nit—feel free to ignore.

🤖 Prompt for AI Agents
In assets/clickhouse/methodology.md at line 70, change the phrase "ClickHouse’s
provided docker images" to "ClickHouse-provided Docker images" to correct the
possessive typo and improve clarity. Also capitalize "Docker" as it is a proper
noun.

CLP_OUT_PATH = f"{WORK_DIR}/archives"
CLP_S_BINARY = "/clp/clp-s"
class clp_s_bench(Benchmark):
class clp_bench(Benchmark):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Rename class to PascalCase to respect PEP-8

PEP-8 prescribes CapWords for class names. Renaming clp_benchClpBench will align this file with the rest of the codebase and with other renamed benchmark classes.

-class clp_bench(Benchmark):
+class ClpBench(Benchmark):

You will also need to adjust the instantiation on Line 64 and any external imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class clp_bench(Benchmark):
class ClpBench(Benchmark):
🤖 Prompt for AI Agents
In assets/clp/main.py at line 8, rename the class from clp_bench to ClpBench to
follow PEP-8 PascalCase naming conventions for classes. Also update any
instantiations of this class around line 64 and adjust any external imports or
references to use the new class name ClpBench.


def main():
bench = clp_s_bench(sys.argv[1])
bench = clp_bench(sys.argv[1])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Instantiation must match updated class name

If you adopt the PascalCase rename, remember to update the constructor here:

-    bench = clp_bench(sys.argv[1])
+    bench = ClpBench(sys.argv[1])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bench = clp_bench(sys.argv[1])
bench = ClpBench(sys.argv[1])
🤖 Prompt for AI Agents
In assets/clp/main.py at line 64, the instantiation of the class uses the old
class name clp_bench. Update this line to use the new PascalCase class name by
replacing clp_bench with ClpBench to match the renamed class constructor.

Comment on lines +13 to +19
def pop_by_path(obj, path):
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional: Add type annotations for better code documentation

Consider adding type annotations to improve code readability and enable better IDE support.

-def pop_by_path(obj, path):
+def pop_by_path(obj: dict, path: str) -> any:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def pop_by_path(obj, path):
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
def pop_by_path(obj: dict, path: str) -> any:
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
🧰 Tools
🪛 Ruff (0.12.2)

13-13: Missing return type annotation for public function pop_by_path

(ANN201)


13-13: Missing type annotation for function argument obj

(ANN001)


13-13: Missing type annotation for function argument path

(ANN001)

🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 13 to 19, the function
pop_by_path lacks type annotations. Add appropriate type hints for the function
parameters and return type, such as specifying that obj is a dictionary with
string keys and any type of values, path is a string, and the return type is
any, to improve code readability and IDE support.

@@ -1,5 +1,5 @@
discovery.uri=http://127.0.0.1:8080
presto.version=0.293-SNAPSHOT-faae543
presto.version=0.293-89ce0f3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Version string now diverges from directory naming used elsewhere

presto.version embeds the commit hash (0.293-89ce0f3), while the paths hard-coded in main.py (Lines 66 & 72) refer to presto-server-0.293 and presto-cli-0.293-executable.jar. This mismatch is harmless at runtime but will cause confusion when the next bump happens.

Recommend either:

  1. Derive the runtime paths from this property, or
  2. Amend the property to exactly match the artefact directory name.

This keeps observability tools and grep searches consistent.

🤖 Prompt for AI Agents
In assets/presto_parquet/include/etc_worker/config.properties at line 2, the
presto.version property includes a commit hash making it differ from the
directory names used in main.py lines 66 and 72. To fix this, either update
presto.version to only include the version number matching the directory names
(e.g., "0.293") or modify main.py to derive the directory paths dynamically from
the full presto.version property to ensure consistency and avoid confusion in
observability and searches.

Comment on lines +66 to 70
self.docker_execute("bash -c \"python3 /home/presto/presto-server/target/presto-server-0.293/bin/launcher.py run --etc-dir=/home/include/etc_coordinator\"", background=True)
self.wait_for_port(8080)
self.docker_execute("/home/presto/presto-native-execution/build/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
self.wait_for_port(7777)
time.sleep(60) # this needs to be more than 10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded versioned paths make future bumps brittle

/home/presto/presto-server/target/presto-server-0.293/... is embedded directly in the command. The same string occurs in multiple places. Every future upgrade will require hunting for these literals.

-"/home/presto/presto-server/target/presto-server-0.293/bin/launcher.py"
+f"/home/presto/presto-server/target/presto-server-{self.properties['presto_version']}/bin/launcher.py"

Populate presto_version once (e.g., from etc_worker/config.properties) and reuse it.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 66 to 70, the presto server
version "0.293" is hard-coded in multiple command strings, making future
upgrades error-prone. To fix this, read the presto_version once from
etc_worker/config.properties or a similar config file, store it in a variable,
and then use that variable to construct the versioned paths dynamically in the
docker_execute commands. This eliminates repeated literals and simplifies
version bumps.

Comment on lines +68 to 70
self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
self.wait_for_port(7777)
time.sleep(60) # this needs to be more than 10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Repeated polling sleep hides readiness issues

A fixed time.sleep(60) after launching the native engine masks startup failures and slows fast boots. Prefer an explicit health probe loop (e.g., curl v1/info or check the TCP port) with a timeout.

This will surface crashes early and shave minutes off CI when the service starts quickly.

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 68 to 70, replace the fixed
time.sleep(60) after starting the native engine with a loop that explicitly
checks service readiness by polling a health endpoint like v1/info or verifying
the TCP port is open, with a timeout to avoid indefinite waiting. This change
will detect startup failures promptly and reduce unnecessary delays during fast
startups.

Comment on lines 72 to 80
def hive_execute(self, query, check=True):
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)

def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add type annotations + avoid positional boolean default

Static analysis (Ruff ANN/F-BT) flags hive_execute/ingest. Low-cost improvement:

-    def hive_execute(self, query, check=True):
+    def hive_execute(self, query: str, check: bool = True) -> str:
...
-    def ingest(self):
+    def ingest(self) -> None:

Improves IDE autocomplete and enforces the public API.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def hive_execute(self, query, check=True):
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
def hive_execute(self, query: str, check: bool = True) -> str:
return self.docker_execute(
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"',
check,
)
def ingest(self) -> None:
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute '
f'"CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"',
)
🧰 Tools
🪛 Ruff (0.12.2)

72-72: Missing return type annotation for public function hive_execute

(ANN201)


72-72: Missing type annotation for function argument query

(ANN001)


72-72: Boolean default positional argument in function definition

(FBT002)


72-72: Missing type annotation for function argument check

(ANN001)


75-75: Missing return type annotation for public function ingest

Add return type annotation: None

(ANN201)

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 72 to 80, the methods hive_execute
and ingest lack type annotations and use a positional boolean default parameter.
Add explicit type annotations for the method parameters and return types, and
change the boolean parameter in hive_execute to a keyword-only argument with a
default value to avoid positional boolean defaults. This will improve static
analysis, IDE autocomplete, and clarify the public API.

Comment on lines +3 to 11
from assets.clp.main import clp_bench
from assets.clickhouse.main import clickhouse_bench
from assets.sparksql.main import sparksql_bench
from assets.parquet.main import parquet_bench
from assets.presto_parquet.main import parquet_bench
from assets.zstandard.main import zstandard_bench
from assets.elasticsearch.main import elasticsearch_bench
from assets.clp_presto.main import clp_presto_bench
from assets.presto_clp.main import presto_clp_bench
from assets.overhead_test.main import overhead_test_bench
from assets.gzip.main import gzip_bench
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Clean up unused imports.

Most benchmark imports are unused since the corresponding benchmarks are commented out. Consider removing unused imports to keep the code clean.

-from assets.clp.main import clp_bench
-from assets.clickhouse.main import clickhouse_bench
-from assets.sparksql.main import sparksql_bench
-from assets.presto_parquet.main import parquet_bench
-from assets.zstandard.main import zstandard_bench
 from assets.elasticsearch.main import elasticsearch_bench
-from assets.presto_clp.main import presto_clp_bench
-from assets.overhead_test.main import overhead_test_bench
-from assets.gzip.main import gzip_bench
-from src.jsonsync import JsonItem

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.12.2)

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)


4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)


5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)


6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)


7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)


9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)


10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)


11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)

🤖 Prompt for AI Agents
In scripts/benchall.py between lines 3 and 11, several benchmark imports are
unused because their corresponding benchmark calls are commented out. Identify
which imported benchmark functions are not used anywhere in the file and remove
those import lines to clean up the code and avoid unnecessary dependencies.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
assets/elasticsearch/ingest.py (1)

13-19: Fix error handling for missing intermediate keys.

The function will raise KeyError if any intermediate key doesn't exist, but the caller expects this only for the final key. Consider adding proper error handling for missing intermediate keys.

As noted in the previous review, this function still lacks type annotations.

-def pop_by_path(obj, path):
+def pop_by_path(obj: dict, path: str) -> any:
     keys = path.split('.')
-    for key in keys[:-1]:
-        obj = obj[key]
+    for key in keys[:-1]:
+        if key not in obj:
+            raise KeyError(f"Key '{key}' not found in path '{path}'")
+        obj = obj[key]
+    if keys[-1] not in obj:
+        raise KeyError(f"Key '{keys[-1]}' not found in path '{path}'")
     out = obj[keys[-1]]
     del obj[keys[-1]]
     return out
scripts/benchall.py (1)

3-11: Clean up unused imports.

As noted in the previous review, most benchmark imports are unused since the corresponding benchmarks are commented out in the benchmarks list. The issue is now more pronounced with only elasticsearch benchmarks enabled.

Consider removing unused imports to keep the code clean and avoid unnecessary dependencies.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03772f1 and 5802501.

📒 Files selected for processing (3)
  • assets/elasticsearch/ingest.py (2 hunks)
  • assets/elasticsearch/main.py (2 hunks)
  • scripts/benchall.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
scripts/benchall.py (5)
assets/elasticsearch/main.py (2)
  • main (88-90)
  • elasticsearch_bench (14-85)
assets/clp/main.py (2)
  • main (63-65)
  • clp_bench (8-60)
assets/presto_clp/main.py (2)
  • main (104-106)
  • presto_clp_bench (19-101)
assets/presto_parquet/main.py (1)
  • parquet_bench (18-145)
src/jsonsync.py (1)
  • JsonItem (5-101)
assets/elasticsearch/main.py (1)
src/template.py (1)
  • docker_execute (240-272)
🪛 Ruff (0.12.2)
scripts/benchall.py

3-3: assets.clp.main.clp_bench imported but unused

Remove unused import: assets.clp.main.clp_bench

(F401)


4-4: assets.clickhouse.main.clickhouse_bench imported but unused

Remove unused import: assets.clickhouse.main.clickhouse_bench

(F401)


5-5: assets.sparksql.main.sparksql_bench imported but unused

Remove unused import: assets.sparksql.main.sparksql_bench

(F401)


6-6: assets.presto_parquet.main.parquet_bench imported but unused

Remove unused import: assets.presto_parquet.main.parquet_bench

(F401)


7-7: assets.zstandard.main.zstandard_bench imported but unused

Remove unused import: assets.zstandard.main.zstandard_bench

(F401)


9-9: assets.presto_clp.main.presto_clp_bench imported but unused

Remove unused import: assets.presto_clp.main.presto_clp_bench

(F401)


10-10: assets.overhead_test.main.overhead_test_bench imported but unused

Remove unused import: assets.overhead_test.main.overhead_test_bench

(F401)


11-11: assets.gzip.main.gzip_bench imported but unused

Remove unused import: assets.gzip.main.gzip_bench

(F401)


12-12: src.jsonsync.JsonItem imported but unused

Remove unused import: src.jsonsync.JsonItem

(F401)


57-57: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)


57-57: Missing type annotation for function argument bencher

(ANN001)


57-57: Missing type annotation for function argument kwargs

(ANN001)


57-57: Missing type annotation for function argument bench_target

(ANN001)


57-57: Boolean default positional argument in function definition

(FBT002)


57-57: Missing type annotation for function argument attach

(ANN001)


57-57: Boolean default positional argument in function definition

(FBT002)


57-57: Missing type annotation for function argument attach_on_error

(ANN001)


71-71: Do not catch blind exception: Exception

(BLE001)


72-72: datetime.datetime.now() called without a tz argument

(DTZ005)


72-72: Use explicit conversion flag

Replace with conversion flag

(RUF010)


86-87: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

assets/elasticsearch/ingest.py

13-13: Missing return type annotation for public function pop_by_path

(ANN201)


13-13: Missing type annotation for function argument obj

(ANN001)


13-13: Missing type annotation for function argument path

(ANN001)


81-81: Missing return type annotation for public function ingest_dataset

Add return type annotation: None

(ANN201)


90-90: Trailing comma missing

Add trailing comma

(COM812)


97-97: Trailing comma missing

Add trailing comma

(COM812)


98-98: Trailing comma missing

Add trailing comma

(COM812)


99-99: Trailing comma missing

Add trailing comma

(COM812)


100-100: Trailing comma missing

Add trailing comma

(COM812)


102-102: Trailing comma missing

Add trailing comma

(COM812)

assets/elasticsearch/main.py

16-16: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


16-16: Missing type annotation for function argument dataset

(ANN001)


16-16: Boolean default positional argument in function definition

(FBT002)


16-16: Missing type annotation for function argument logsdb

(ANN001)


43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb" instead of if-else-block

Replace if-else-block with logsdb = "anything" if self.logsdb else "no_logsdb"

(SIM108)


49-49: Trailing comma missing

Add trailing comma

(COM812)

🔇 Additional comments (3)
assets/elasticsearch/ingest.py (2)

67-72: LGTM!

The timestamp extraction logic is well-implemented with appropriate fallback handling for missing timestamp fields.


76-76: LGTM!

Adding _op_type: "create" is a good practice for data ingestion to prevent accidental overwrites.

scripts/benchall.py (1)

37-55: LGTM!

The benchmark selection appropriately focuses on testing the new elasticsearch logsdb functionality with both True and False values, which aligns with the PR objectives to improve Elasticsearch benchmarking.

Comment on lines +84 to +104
if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb"
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
}
}
}
},
"priority": 101
}
es.indices.put_index_template(name=collection_name, body=template_body)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider minor formatting improvements.

The index template logic is well-implemented with comprehensive date format support for the @timestamp field. The conditional creation based on the logsdb flag is appropriate.

Consider adding trailing commas for better diff readability:

                 "settings": {
                     "index": {
-                        "mode": "logsdb"
+                        "mode": "logsdb",
                     },
                 },
                 "mappings": {
                     "properties": {
                         "@timestamp": {
                             "type": "date",
-                            "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
-                            }
+                            "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",
+                            },
                     }
-                }
+                },
             },
-            "priority": 101
+            "priority": 101,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb"
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
}
}
}
},
"priority": 101
}
es.indices.put_index_template(name=collection_name, body=template_body)
if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb",
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",
},
},
},
},
"priority": 101,
}
es.indices.put_index_template(name=collection_name, body=template_body)
🧰 Tools
🪛 Ruff (0.12.2)

90-90: Trailing comma missing

Add trailing comma

(COM812)


97-97: Trailing comma missing

Add trailing comma

(COM812)


98-98: Trailing comma missing

Add trailing comma

(COM812)


99-99: Trailing comma missing

Add trailing comma

(COM812)


100-100: Trailing comma missing

Add trailing comma

(COM812)


102-102: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 84 to 104, add trailing commas
after the last elements in dictionaries and lists within the template_body
definition. This includes after the "mode": "logsdb" setting, the "@timestamp"
mapping dictionary, and the overall dictionary entries. Adding these trailing
commas improves diff readability and consistency in formatting.

Comment on lines +16 to +22
def __init__(self, dataset, logsdb=True):
super().__init__(dataset)

self.logsdb = logsdb

if not logsdb:
self.properties["notes"] = "no logsdb"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider minor improvements for code quality.

The logsdb parameter addition is well-implemented with appropriate default value and property tracking.

Consider addressing the static analysis suggestions:

-    def __init__(self, dataset, logsdb=True):
+    def __init__(self, dataset: str, *, logsdb: bool = True) -> None:

Using keyword-only argument (*) prevents the boolean parameter pitfall flagged by static analysis.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, dataset, logsdb=True):
super().__init__(dataset)
self.logsdb = logsdb
if not logsdb:
self.properties["notes"] = "no logsdb"
def __init__(self, dataset: str, *, logsdb: bool = True) -> None:
super().__init__(dataset)
self.logsdb = logsdb
if not logsdb:
self.properties["notes"] = "no logsdb"
🧰 Tools
🪛 Ruff (0.12.2)

16-16: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


16-16: Missing type annotation for function argument dataset

(ANN001)


16-16: Boolean default positional argument in function definition

(FBT002)


16-16: Missing type annotation for function argument logsdb

(ANN001)

🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 16 to 22, the logsdb parameter
should be made keyword-only to avoid boolean parameter pitfalls flagged by
static analysis. Modify the __init__ method signature to include a * before
logsdb, making logsdb a keyword-only argument. This improves code clarity and
prevents accidental positional argument misuse.

Comment on lines +43 to 50
if self.logsdb:
logsdb = "anything"
else:
logsdb = "no_logsdb"

self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider simplifying the conditional logic.

The updated ingest method correctly passes the required arguments to the ingestion script, including the new logsdb flag and timestamp path.

Consider simplifying the conditional logic and adding trailing comma:

-        if self.logsdb:
-            logsdb = "anything"
-        else:
-            logsdb = "no_logsdb"
+        logsdb = "anything" if self.logsdb else "no_logsdb"

         self.docker_execute([
-            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
+            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",
             ])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.logsdb:
logsdb = "anything"
else:
logsdb = "no_logsdb"
self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
])
logsdb = "anything" if self.logsdb else "no_logsdb"
self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",
])
🧰 Tools
🪛 Ruff (0.12.2)

43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb" instead of if-else-block

Replace if-else-block with logsdb = "anything" if self.logsdb else "no_logsdb"

(SIM108)


49-49: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 43 to 50, simplify the conditional
assignment of logsdb by using a single line conditional expression instead of an
if-else block. Also, add a trailing comma to the list passed to
self.docker_execute for consistency and style.

]

def run(bencher, kwargs, bench_target, attach=False):
def run(bencher, kwargs, bench_target, attach=False, attach_on_error=False):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! The attach_on_error parameter enhances debugging capabilities.

The new parameter provides useful flexibility for debugging benchmark failures without requiring full attachment mode.

Consider addressing the static analysis suggestions for boolean parameters and type annotations to improve code quality.

🧰 Tools
🪛 Ruff (0.12.2)

57-57: Missing return type annotation for public function run

Add return type annotation: None

(ANN201)


57-57: Missing type annotation for function argument bencher

(ANN001)


57-57: Missing type annotation for function argument kwargs

(ANN001)


57-57: Missing type annotation for function argument bench_target

(ANN001)


57-57: Boolean default positional argument in function definition

(FBT002)


57-57: Missing type annotation for function argument attach

(ANN001)


57-57: Boolean default positional argument in function definition

(FBT002)


57-57: Missing type annotation for function argument attach_on_error

(ANN001)

🤖 Prompt for AI Agents
In scripts/benchall.py at line 57, the run function is missing type annotations
for its boolean parameters. Add type annotations for the attach and
attach_on_error parameters, specifying them as bool, and ensure all other
parameters are also properly annotated for clarity and static analysis
compliance.

Comment on lines +86 to +88
if len(sys.argv) > 1:
if dataset_name != sys.argv[1].strip():
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider simplifying the conditional logic.

The dataset filtering capability is a useful addition for targeted benchmark execution.

Consider combining the nested if statements as suggested by static analysis:

-        if len(sys.argv) > 1:
-            if dataset_name != sys.argv[1].strip():
-                continue
+        if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip():
+            continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(sys.argv) > 1:
if dataset_name != sys.argv[1].strip():
continue
if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip():
continue
🧰 Tools
🪛 Ruff (0.12.2)

86-87: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🤖 Prompt for AI Agents
In scripts/benchall.py at lines 86 to 88, the nested if statements checking
sys.argv length and comparing dataset_name can be combined into a single
conditional statement; refactor these lines to use a single if statement that
performs both checks together for simplicity.

Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to break up the PR

@PacificViking
Copy link
Contributor Author

Need to break up the PR

Sorry, this PR was not cherry-picked correctly.

lock clickhouse version and older commits are already merged.

improve benchall.py only running on one dataset is an irrelevant and short change.

elasticsearch changes and later commits are part of the PR.

I don't know how to fix this issue without redoing the cherry pick and force pushing, but I'd like confirmation before I do this.

@Bill-hbrhbr
Copy link
Contributor

Need to break up the PR

Sorry, this PR was not cherry-picked correctly.

lock clickhouse version and older commits are already merged.

improve benchall.py only running on one dataset is an irrelevant and short change.

elasticsearch changes and later commits are part of the PR.

I don't know how to fix this issue without redoing the cherry pick and force pushing, but I'd like confirmation before I do this.

Hey John @PacificViking, I'm marking this PR as change requested because we cannot immediately merge this PR. Now I see where this PR is coming from, but still some parts of it require some more refactoring.

I can take it from here. If there's anything I need to ask you, I will DM you. Thanks again for the explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants