-
Notifications
You must be signed in to change notification settings - Fork 5
feat: Bump elasticsearch version and improve methodology. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Bump elasticsearch version and improve methodology. #9
Conversation
WalkthroughThis update revises Dockerfiles to use official or version-pinned base images, upgrades ClickHouse and Elasticsearch versions, and updates corresponding documentation. Several benchmark class names are renamed for consistency. The Elasticsearch ingestion pipeline is enhanced to support timestamp extraction and index template application. The benchmark runner script is refactored for improved error handling and dataset selection. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant benchall.py
participant BenchmarkClass
participant DockerContainer
User->>benchall.py: Run script with optional dataset argument
benchall.py->>BenchmarkClass: Initialize benchmark (e.g., elasticsearch_bench)
loop For each dataset
benchall.py->>BenchmarkClass: run(bench_target, attach, attach_on_error)
BenchmarkClass->>DockerContainer: Execute ingestion/search commands
alt Error occurs
benchall.py->>DockerContainer: Attach if attach_on_error is True
end
end
sequenceDiagram
participant ingest.py
participant Elasticsearch
participant Dataset
ingest.py->>Dataset: Read each JSON line
ingest.py->>ingest.py: pop_by_path(line, timestamp_path)
alt Timestamp found
ingest.py->>ingest.py: Set @timestamp to extracted value
else Not found
ingest.py->>ingest.py: Set @timestamp to 0
end
ingest.py->>Elasticsearch: Create index template (logsdb, mapping, priority)
ingest.py->>Elasticsearch: Bulk ingest with _op_type="create"
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🔭 Outside diff range comments (1)
assets/clickhouse/methodology.md (1)
3-11
: Documentation still references thelatest
tag—update for version 25.6Lines 9-11 state the benchmark uses
clickhouse/clickhouse-server:latest
, which conflicts with the newly-pinned25.6
image (and could mislead future maintainers).Please replace the reference with the exact tag/digest used in
assets/clickhouse/Dockerfile
to keep the methodology accurate.
♻️ Duplicate comments (1)
assets/template/Dockerfile (1)
5-5
: Same dependency check & image-size clean-up applies hereThe observations made for
assets/overhead_test/Dockerfile
equally apply to this template Dockerfile: verify run-time deps did not vanish and purge the apt lists after installation to avoid a bloated benchmark image.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (17)
assets/clickhouse/Dockerfile
(1 hunks)assets/clickhouse/main.py
(2 hunks)assets/clickhouse/methodology.md
(2 hunks)assets/clp/main.py
(2 hunks)assets/elasticsearch/Dockerfile
(2 hunks)assets/elasticsearch/ingest.py
(2 hunks)assets/elasticsearch/main.py
(1 hunks)assets/elasticsearch/methodology.md
(2 hunks)assets/gzip/Dockerfile
(1 hunks)assets/gzip/main.py
(1 hunks)assets/overhead_test/Dockerfile
(1 hunks)assets/presto_clp/main.py
(2 hunks)assets/presto_parquet/Dockerfile
(2 hunks)assets/presto_parquet/include/etc_worker/config.properties
(1 hunks)assets/presto_parquet/main.py
(1 hunks)assets/template/Dockerfile
(1 hunks)scripts/benchall.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.
📚 Learning: in benchmarking docker files, the --allow-downgrades flag in apt-get install commands is intentional...
Learnt from: PacificViking
PR: y-scope/log-archival-bench#5
File: assets/gzip/Dockerfile:0-0
Timestamp: 2025-07-30T13:55:45.517Z
Learning: In benchmarking Docker files, the --allow-downgrades flag in apt-get install commands is intentionally used to ensure benchmarks can run on specified package versions even when Ubuntu has newer versions, maintaining consistency for reliable benchmark results.
Applied to files:
assets/overhead_test/Dockerfile
assets/template/Dockerfile
assets/presto_parquet/Dockerfile
assets/gzip/Dockerfile
🧬 Code Graph Analysis (4)
assets/clp/main.py (1)
src/template.py (1)
Benchmark
(55-420)
assets/presto_clp/main.py (1)
src/template.py (1)
Benchmark
(55-420)
assets/clickhouse/main.py (1)
src/template.py (1)
Benchmark
(55-420)
assets/gzip/main.py (1)
src/template.py (2)
docker_execute
(240-272)bench_start
(293-307)
🪛 Checkov (3.2.334)
assets/clickhouse/Dockerfile
[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-1: Ensure that a user for the container has been created
(CKV_DOCKER_3)
assets/presto_parquet/Dockerfile
[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-62: Ensure that a user for the container has been created
(CKV_DOCKER_3)
[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-62: Ensure that a user for the container has been created
(CKV_DOCKER_3)
assets/elasticsearch/Dockerfile
[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-24: Ensure that a user for the container has been created
(CKV_DOCKER_3)
[LOW] 1-24: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-24: Ensure that a user for the container has been created
(CKV_DOCKER_3)
🪛 Hadolint (2.12.0)
assets/presto_parquet/Dockerfile
[info] 43-43: Multiple consecutive RUN
instructions. Consider consolidation.
(DL3059)
[info] 44-44: Multiple consecutive RUN
instructions. Consider consolidation.
(DL3059)
[info] 58-58: Delete the apt-get lists after installing something
(DL3009)
[warning] 60-60: Pin versions in apt get install. Instead of apt-get install <package>
use apt-get install <package>=<version>
(DL3008)
[info] 60-60: Avoid additional packages by specifying --no-install-recommends
(DL3015)
[info] 60-60: Multiple consecutive RUN
instructions. Consider consolidation.
(DL3059)
assets/gzip/Dockerfile
[warning] 8-8: Pin versions in apt get install. Instead of apt-get install <package>
use apt-get install <package>=<version>
(DL3008)
[info] 8-8: Delete the apt-get lists after installing something
(DL3009)
[info] 8-8: Avoid additional packages by specifying --no-install-recommends
(DL3015)
assets/elasticsearch/Dockerfile
[info] 14-14: Avoid additional packages by specifying --no-install-recommends
(DL3015)
[info] 14-14: Delete the apt-get lists after installing something
(DL3009)
[warning] 14-14: Pin versions in apt get install. Instead of apt-get install <package>
use apt-get install <package>=<version>
(DL3008)
[warning] 24-24: Pin versions in pip. Instead of pip install <package>
use pip install <package>==<version>
or pip install --requirement <requirements file>
(DL3013)
[warning] 24-24: Avoid use of cache directory with pip. Use pip install --no-cache-dir <package>
(DL3042)
🪛 Ruff (0.12.2)
assets/presto_parquet/main.py
72-72: Missing return type annotation for public function hive_execute
(ANN201)
72-72: Missing type annotation for function argument query
(ANN001)
72-72: Boolean default positional argument in function definition
(FBT002)
72-72: Missing type annotation for function argument check
(ANN001)
75-75: Missing return type annotation for public function ingest
Add return type annotation: None
(ANN201)
assets/elasticsearch/main.py
39-39: Trailing comma missing
Add trailing comma
(COM812)
assets/gzip/main.py
40-40: Function call with shell=True
parameter identified, security issue
(S604)
42-42: Missing return type annotation for public function search
Add return type annotation: int
(ANN201)
42-42: Missing type annotation for function argument query
(ANN001)
46-46: Use of assert
detected
(S101)
48-48: Function call with shell=True
parameter identified, security issue
(S604)
50-50: Missing return type annotation for public function bench_search
Add return type annotation: None
(ANN201)
50-50: Boolean default positional argument in function definition
(FBT002)
50-50: Missing type annotation for function argument cold
(ANN001)
56-56: Function call with shell=True
parameter identified, security issue
(S604)
assets/elasticsearch/ingest.py
13-13: Missing return type annotation for public function pop_by_path
(ANN201)
13-13: Missing type annotation for function argument obj
(ANN001)
13-13: Missing type annotation for function argument path
(ANN001)
81-81: Missing return type annotation for public function ingest_dataset
Add return type annotation: None
(ANN201)
89-89: Trailing comma missing
Add trailing comma
(COM812)
96-96: Trailing comma missing
Add trailing comma
(COM812)
97-97: Trailing comma missing
Add trailing comma
(COM812)
98-98: Trailing comma missing
Add trailing comma
(COM812)
99-99: Trailing comma missing
Add trailing comma
(COM812)
101-101: Trailing comma missing
Add trailing comma
(COM812)
scripts/benchall.py
3-3: assets.clp.main.clp_bench
imported but unused
Remove unused import: assets.clp.main.clp_bench
(F401)
4-4: assets.clickhouse.main.clickhouse_bench
imported but unused
Remove unused import: assets.clickhouse.main.clickhouse_bench
(F401)
5-5: assets.sparksql.main.sparksql_bench
imported but unused
Remove unused import: assets.sparksql.main.sparksql_bench
(F401)
6-6: assets.presto_parquet.main.parquet_bench
imported but unused
Remove unused import: assets.presto_parquet.main.parquet_bench
(F401)
7-7: assets.zstandard.main.zstandard_bench
imported but unused
Remove unused import: assets.zstandard.main.zstandard_bench
(F401)
9-9: assets.presto_clp.main.presto_clp_bench
imported but unused
Remove unused import: assets.presto_clp.main.presto_clp_bench
(F401)
10-10: assets.overhead_test.main.overhead_test_bench
imported but unused
Remove unused import: assets.overhead_test.main.overhead_test_bench
(F401)
11-11: assets.gzip.main.gzip_bench
imported but unused
Remove unused import: assets.gzip.main.gzip_bench
(F401)
12-12: src.jsonsync.JsonItem
imported but unused
Remove unused import: src.jsonsync.JsonItem
(F401)
56-56: Missing return type annotation for public function run
Add return type annotation: None
(ANN201)
56-56: Missing type annotation for function argument bencher
(ANN001)
56-56: Missing type annotation for function argument kwargs
(ANN001)
56-56: Missing type annotation for function argument bench_target
(ANN001)
56-56: Boolean default positional argument in function definition
(FBT002)
56-56: Missing type annotation for function argument attach
(ANN001)
56-56: Boolean default positional argument in function definition
(FBT002)
56-56: Missing type annotation for function argument attach_on_error
(ANN001)
70-70: Do not catch blind exception: Exception
(BLE001)
71-71: datetime.datetime.now()
called without a tz
argument
(DTZ005)
71-71: Use explicit conversion flag
Replace with conversion flag
(RUF010)
85-86: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
🔇 Additional comments (20)
assets/presto_clp/main.py (1)
19-19
: LGTM: Class naming standardizationThe class renaming from
clp_presto_bench
topresto_clp_bench
improves consistency with the directory structure and aligns with similar standardization efforts across other benchmark modules.Also applies to: 105-105
assets/clickhouse/main.py (1)
16-16
: LGTM: Simplified class namingThe class renaming from
clickhouse_native_json_bench
toclickhouse_bench
simplifies the naming while maintaining clarity, consistent with the broader naming standardization effort.Also applies to: 157-157
assets/gzip/Dockerfile (2)
5-5
: Good: Migration to official base imageMoving from the custom base image to the official
ubuntu:jammy
image improves maintainability and reduces dependencies on custom infrastructure.
9-9
: Intentional use of --allow-downgrades flagThe
--allow-downgrades
flag is intentionally used for benchmarking consistency, as noted in previous feedback. This ensures benchmarks can run on specified package versions even when Ubuntu has newer versions available.assets/elasticsearch/Dockerfile (2)
1-1
: Good: Consistent base image migrationMoving to the official
ubuntu:jammy
base image aligns with similar changes across other Dockerfiles and improves maintainability.
11-12
: LGTM: Elasticsearch version upgrade to 9.0.3The coordinated upgrade of the Elasticsearch APT repository (9.x), package version (9.0.3), and Python client (9.0.3) properly implements the main objective of this PR. Version pinning ensures reproducible builds.
Also applies to: 16-16, 24-24
assets/elasticsearch/ingest.py (4)
13-19
: Good: Utility function for nested value extractionThe
pop_by_path
function cleanly handles extraction of nested JSON values using dot notation, which supports the timestamp extraction feature.
67-72
: LGTM: Robust timestamp extractionThe timestamp extraction logic properly handles cases where the specified path doesn't exist by falling back to a default value of 0. This supports the new logs data stream functionality.
84-103
: Excellent: Index template for logs data streamThe index template configuration with
"mode": "logsdb"
and proper@timestamp
field mapping aligns perfectly with Elasticsearch 9.x best practices for log data ingestion. The flexible date format handling and priority setting ensure proper template application.
76-76
: Good: Explicit operation type specificationSetting
_op_type
to"create"
ensures documents are only created and not updated, which is appropriate for log ingestion scenarios.assets/elasticsearch/methodology.md (2)
3-3
: LGTM! Version update aligns with PR objectives.The Elasticsearch version upgrade from 8.6.2 to 9.0.3 is consistent with the PR's goal to modernize the tooling.
19-20
: Good addition! Clarifies the ingestion approach.The note about using logs data stream provides valuable context for users understanding the optimized timestamped log handling approach.
assets/gzip/main.py (3)
40-40
: LGTM! Shell redirection requires shell=True.The addition of
shell=True
is necessary for the shell redirection (>
) to work properly with thedocker_execute
method.
48-48
: LGTM! Consistent with shell redirection requirements.The
shell=True
parameter is correctly added for the shell command with redirection.
56-56
: LGTM! Proper shell command execution.The
shell=True
parameter ensures the gunzip command with redirection executes correctly within the Docker container.assets/elasticsearch/main.py (1)
39-39
: Action Required: Confirmtimestamp
in dataset metadataThe ingestion pipeline now passes
dataset_meta['timestamp']
toingest.py
(which defaults missing values to 0). Before merging, please verify that:
- Every dataset definition loaded by
src/template
sets a validtimestamp
key.- Your metadata files (e.g. JSON under your metadata directory) contain the
timestamp
field at the correct JSON path.You can run, for example:
fd -e json path/to/your/metadata | xargs -I {} jq 'if has("timestamp") then "✓ {}" else "✗ {}" end' {}scripts/benchall.py (4)
37-54
: LGTM! Focused testing approach for Elasticsearch upgrade.Commenting out other benchmarks to focus on Elasticsearch testing is appropriate for this PR's scope.
56-56
: LGTM! Improved error handling with attach_on_error parameter.The addition of
attach_on_error
parameter provides better debugging capabilities when benchmarks fail.
75-75
: LGTM! Enhanced exception handling logic.The updated condition
if attach or attach_on_error
properly implements the new debugging feature.
85-89
: LGTM! Useful dataset filtering capability.The command-line dataset filtering and
attach_on_error=True
usage improve the script's flexibility and debugging capabilities.
@@ -1,2 +1 @@ | |||
FROM clickhouse/clickhouse-server:latest | |||
|
|||
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Good pinning; add HEALTHCHECK & non-root user
Pinning the ClickHouse image by digest is 👍. Consider adding:
- A non-root user to follow container hardening guidelines.
- A simple
HEALTHCHECK
so orchestration tools can detect crash-loops.
Example:
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
+
+# Create non-root user
+RUN useradd --system --uid 1001 clickhouse
+USER clickhouse
+
+# Basic liveness probe
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
+ clickhouse-client --query "SELECT 1" || exit 1
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f | |
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f | |
# Create non-root user | |
RUN useradd --system --uid 1001 clickhouse | |
USER clickhouse | |
# Basic liveness probe | |
HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \ | |
clickhouse-client --query "SELECT 1" || exit 1 |
🧰 Tools
🪛 Checkov (3.2.334)
[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
[LOW] 1-1: Ensure that a user for the container has been created
(CKV_DOCKER_3)
🤖 Prompt for AI Agents
In assets/clickhouse/Dockerfile at line 1, the base image is pinned by digest
which is good, but the Dockerfile lacks a HEALTHCHECK instruction and does not
switch to a non-root user. To fix this, add a HEALTHCHECK command that
periodically verifies the ClickHouse server is responsive, and create or switch
to a non-root user before running the server to improve container security and
comply with best practices.
|
||
**Clearing Caches:** | ||
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches. | ||
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Minor typo: possessive “ClickHouse’s” vs plural
“ClickHouse’s provided docker images” → “ClickHouse-provided Docker images” for clarity.
Very small nit—feel free to ignore.
🤖 Prompt for AI Agents
In assets/clickhouse/methodology.md at line 70, change the phrase "ClickHouse’s
provided docker images" to "ClickHouse-provided Docker images" to correct the
possessive typo and improve clarity. Also capitalize "Docker" as it is a proper
noun.
CLP_OUT_PATH = f"{WORK_DIR}/archives" | ||
CLP_S_BINARY = "/clp/clp-s" | ||
class clp_s_bench(Benchmark): | ||
class clp_bench(Benchmark): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Rename class to PascalCase to respect PEP-8
PEP-8 prescribes CapWords
for class names. Renaming clp_bench
→ ClpBench
will align this file with the rest of the codebase and with other renamed benchmark classes.
-class clp_bench(Benchmark):
+class ClpBench(Benchmark):
You will also need to adjust the instantiation on Line 64 and any external imports.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class clp_bench(Benchmark): | |
class ClpBench(Benchmark): |
🤖 Prompt for AI Agents
In assets/clp/main.py at line 8, rename the class from clp_bench to ClpBench to
follow PEP-8 PascalCase naming conventions for classes. Also update any
instantiations of this class around line 64 and adjust any external imports or
references to use the new class name ClpBench.
|
||
def main(): | ||
bench = clp_s_bench(sys.argv[1]) | ||
bench = clp_bench(sys.argv[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Instantiation must match updated class name
If you adopt the PascalCase rename, remember to update the constructor here:
- bench = clp_bench(sys.argv[1])
+ bench = ClpBench(sys.argv[1])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
bench = clp_bench(sys.argv[1]) | |
bench = ClpBench(sys.argv[1]) |
🤖 Prompt for AI Agents
In assets/clp/main.py at line 64, the instantiation of the class uses the old
class name clp_bench. Update this line to use the new PascalCase class name by
replacing clp_bench with ClpBench to match the renamed class constructor.
def pop_by_path(obj, path): | ||
keys = path.split('.') | ||
for key in keys[:-1]: | ||
obj = obj[key] | ||
out = obj[keys[-1]] | ||
del obj[keys[-1]] | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Optional: Add type annotations for better code documentation
Consider adding type annotations to improve code readability and enable better IDE support.
-def pop_by_path(obj, path):
+def pop_by_path(obj: dict, path: str) -> any:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def pop_by_path(obj, path): | |
keys = path.split('.') | |
for key in keys[:-1]: | |
obj = obj[key] | |
out = obj[keys[-1]] | |
del obj[keys[-1]] | |
return out | |
def pop_by_path(obj: dict, path: str) -> any: | |
keys = path.split('.') | |
for key in keys[:-1]: | |
obj = obj[key] | |
out = obj[keys[-1]] | |
del obj[keys[-1]] | |
return out |
🧰 Tools
🪛 Ruff (0.12.2)
13-13: Missing return type annotation for public function pop_by_path
(ANN201)
13-13: Missing type annotation for function argument obj
(ANN001)
13-13: Missing type annotation for function argument path
(ANN001)
🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 13 to 19, the function
pop_by_path lacks type annotations. Add appropriate type hints for the function
parameters and return type, such as specifying that obj is a dictionary with
string keys and any type of values, path is a string, and the return type is
any, to improve code readability and IDE support.
@@ -1,5 +1,5 @@ | |||
discovery.uri=http://127.0.0.1:8080 | |||
presto.version=0.293-SNAPSHOT-faae543 | |||
presto.version=0.293-89ce0f3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Version string now diverges from directory naming used elsewhere
presto.version
embeds the commit hash (0.293-89ce0f3
), while the paths hard-coded in main.py
(Lines 66 & 72) refer to presto-server-0.293
and presto-cli-0.293-executable.jar
. This mismatch is harmless at runtime but will cause confusion when the next bump happens.
Recommend either:
- Derive the runtime paths from this property, or
- Amend the property to exactly match the artefact directory name.
This keeps observability tools and grep searches consistent.
🤖 Prompt for AI Agents
In assets/presto_parquet/include/etc_worker/config.properties at line 2, the
presto.version property includes a commit hash making it differ from the
directory names used in main.py lines 66 and 72. To fix this, either update
presto.version to only include the version number matching the directory names
(e.g., "0.293") or modify main.py to derive the directory paths dynamically from
the full presto.version property to ensure consistency and avoid confusion in
observability and searches.
self.docker_execute("bash -c \"python3 /home/presto/presto-server/target/presto-server-0.293/bin/launcher.py run --etc-dir=/home/include/etc_coordinator\"", background=True) | ||
self.wait_for_port(8080) | ||
self.docker_execute("/home/presto/presto-native-execution/build/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True) | ||
self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True) | ||
self.wait_for_port(7777) | ||
time.sleep(60) # this needs to be more than 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Hard-coded versioned paths make future bumps brittle
/home/presto/presto-server/target/presto-server-0.293/...
is embedded directly in the command. The same string occurs in multiple places. Every future upgrade will require hunting for these literals.
-"/home/presto/presto-server/target/presto-server-0.293/bin/launcher.py"
+f"/home/presto/presto-server/target/presto-server-{self.properties['presto_version']}/bin/launcher.py"
Populate presto_version
once (e.g., from etc_worker/config.properties
) and reuse it.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 66 to 70, the presto server
version "0.293" is hard-coded in multiple command strings, making future
upgrades error-prone. To fix this, read the presto_version once from
etc_worker/config.properties or a similar config file, store it in a variable,
and then use that variable to construct the versioned paths dynamically in the
docker_execute commands. This eliminates repeated literals and simplifies
version bumps.
self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True) | ||
self.wait_for_port(7777) | ||
time.sleep(60) # this needs to be more than 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Repeated polling sleep hides readiness issues
A fixed time.sleep(60)
after launching the native engine masks startup failures and slows fast boots. Prefer an explicit health probe loop (e.g., curl v1/info
or check the TCP port) with a timeout.
This will surface crashes early and shave minutes off CI when the service starts quickly.
🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 68 to 70, replace the fixed
time.sleep(60) after starting the native engine with a loop that explicitly
checks service readiness by polling a health endpoint like v1/info or verifying
the TCP port is open, with a timeout to avoid indefinite waiting. This change
will detect startup failures promptly and reduce unnecessary delays during fast
startups.
def hive_execute(self, query, check=True): | ||
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check) | ||
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check) | ||
|
||
def ingest(self): | ||
""" | ||
Ingests the dataset at self.datasets_path | ||
""" | ||
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"') | ||
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Add type annotations + avoid positional boolean default
Static analysis (Ruff ANN/F-BT) flags hive_execute
/ingest
. Low-cost improvement:
- def hive_execute(self, query, check=True):
+ def hive_execute(self, query: str, check: bool = True) -> str:
...
- def ingest(self):
+ def ingest(self) -> None:
Improves IDE autocomplete and enforces the public API.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def hive_execute(self, query, check=True): | |
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check) | |
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check) | |
def ingest(self): | |
""" | |
Ingests the dataset at self.datasets_path | |
""" | |
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"') | |
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"') | |
def hive_execute(self, query: str, check: bool = True) -> str: | |
return self.docker_execute( | |
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar ' | |
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', | |
check, | |
) | |
def ingest(self) -> None: | |
""" | |
Ingests the dataset at self.datasets_path | |
""" | |
self.docker_execute( | |
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar ' | |
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute ' | |
f'"CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"', | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
72-72: Missing return type annotation for public function hive_execute
(ANN201)
72-72: Missing type annotation for function argument query
(ANN001)
72-72: Boolean default positional argument in function definition
(FBT002)
72-72: Missing type annotation for function argument check
(ANN001)
75-75: Missing return type annotation for public function ingest
Add return type annotation: None
(ANN201)
🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 72 to 80, the methods hive_execute
and ingest lack type annotations and use a positional boolean default parameter.
Add explicit type annotations for the method parameters and return types, and
change the boolean parameter in hive_execute to a keyword-only argument with a
default value to avoid positional boolean defaults. This will improve static
analysis, IDE autocomplete, and clarify the public API.
from assets.clp.main import clp_bench | ||
from assets.clickhouse.main import clickhouse_bench | ||
from assets.sparksql.main import sparksql_bench | ||
from assets.parquet.main import parquet_bench | ||
from assets.presto_parquet.main import parquet_bench | ||
from assets.zstandard.main import zstandard_bench | ||
from assets.elasticsearch.main import elasticsearch_bench | ||
from assets.clp_presto.main import clp_presto_bench | ||
from assets.presto_clp.main import presto_clp_bench | ||
from assets.overhead_test.main import overhead_test_bench | ||
from assets.gzip.main import gzip_bench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Clean up unused imports.
Most benchmark imports are unused since the corresponding benchmarks are commented out. Consider removing unused imports to keep the code clean.
-from assets.clp.main import clp_bench
-from assets.clickhouse.main import clickhouse_bench
-from assets.sparksql.main import sparksql_bench
-from assets.presto_parquet.main import parquet_bench
-from assets.zstandard.main import zstandard_bench
from assets.elasticsearch.main import elasticsearch_bench
-from assets.presto_clp.main import presto_clp_bench
-from assets.overhead_test.main import overhead_test_bench
-from assets.gzip.main import gzip_bench
-from src.jsonsync import JsonItem
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.12.2)
3-3: assets.clp.main.clp_bench
imported but unused
Remove unused import: assets.clp.main.clp_bench
(F401)
4-4: assets.clickhouse.main.clickhouse_bench
imported but unused
Remove unused import: assets.clickhouse.main.clickhouse_bench
(F401)
5-5: assets.sparksql.main.sparksql_bench
imported but unused
Remove unused import: assets.sparksql.main.sparksql_bench
(F401)
6-6: assets.presto_parquet.main.parquet_bench
imported but unused
Remove unused import: assets.presto_parquet.main.parquet_bench
(F401)
7-7: assets.zstandard.main.zstandard_bench
imported but unused
Remove unused import: assets.zstandard.main.zstandard_bench
(F401)
9-9: assets.presto_clp.main.presto_clp_bench
imported but unused
Remove unused import: assets.presto_clp.main.presto_clp_bench
(F401)
10-10: assets.overhead_test.main.overhead_test_bench
imported but unused
Remove unused import: assets.overhead_test.main.overhead_test_bench
(F401)
11-11: assets.gzip.main.gzip_bench
imported but unused
Remove unused import: assets.gzip.main.gzip_bench
(F401)
🤖 Prompt for AI Agents
In scripts/benchall.py between lines 3 and 11, several benchmark imports are
unused because their corresponding benchmark calls are commented out. Identify
which imported benchmark functions are not used anywhere in the file and remove
those import lines to clean up the code and avoid unnecessary dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (2)
assets/elasticsearch/ingest.py (1)
13-19
: Fix error handling for missing intermediate keys.The function will raise KeyError if any intermediate key doesn't exist, but the caller expects this only for the final key. Consider adding proper error handling for missing intermediate keys.
As noted in the previous review, this function still lacks type annotations.
-def pop_by_path(obj, path): +def pop_by_path(obj: dict, path: str) -> any: keys = path.split('.') - for key in keys[:-1]: - obj = obj[key] + for key in keys[:-1]: + if key not in obj: + raise KeyError(f"Key '{key}' not found in path '{path}'") + obj = obj[key] + if keys[-1] not in obj: + raise KeyError(f"Key '{keys[-1]}' not found in path '{path}'") out = obj[keys[-1]] del obj[keys[-1]] return outscripts/benchall.py (1)
3-11
: Clean up unused imports.As noted in the previous review, most benchmark imports are unused since the corresponding benchmarks are commented out in the benchmarks list. The issue is now more pronounced with only elasticsearch benchmarks enabled.
Consider removing unused imports to keep the code clean and avoid unnecessary dependencies.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
assets/elasticsearch/ingest.py
(2 hunks)assets/elasticsearch/main.py
(2 hunks)scripts/benchall.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
scripts/benchall.py (5)
assets/elasticsearch/main.py (2)
main
(88-90)elasticsearch_bench
(14-85)assets/clp/main.py (2)
main
(63-65)clp_bench
(8-60)assets/presto_clp/main.py (2)
main
(104-106)presto_clp_bench
(19-101)assets/presto_parquet/main.py (1)
parquet_bench
(18-145)src/jsonsync.py (1)
JsonItem
(5-101)
assets/elasticsearch/main.py (1)
src/template.py (1)
docker_execute
(240-272)
🪛 Ruff (0.12.2)
scripts/benchall.py
3-3: assets.clp.main.clp_bench
imported but unused
Remove unused import: assets.clp.main.clp_bench
(F401)
4-4: assets.clickhouse.main.clickhouse_bench
imported but unused
Remove unused import: assets.clickhouse.main.clickhouse_bench
(F401)
5-5: assets.sparksql.main.sparksql_bench
imported but unused
Remove unused import: assets.sparksql.main.sparksql_bench
(F401)
6-6: assets.presto_parquet.main.parquet_bench
imported but unused
Remove unused import: assets.presto_parquet.main.parquet_bench
(F401)
7-7: assets.zstandard.main.zstandard_bench
imported but unused
Remove unused import: assets.zstandard.main.zstandard_bench
(F401)
9-9: assets.presto_clp.main.presto_clp_bench
imported but unused
Remove unused import: assets.presto_clp.main.presto_clp_bench
(F401)
10-10: assets.overhead_test.main.overhead_test_bench
imported but unused
Remove unused import: assets.overhead_test.main.overhead_test_bench
(F401)
11-11: assets.gzip.main.gzip_bench
imported but unused
Remove unused import: assets.gzip.main.gzip_bench
(F401)
12-12: src.jsonsync.JsonItem
imported but unused
Remove unused import: src.jsonsync.JsonItem
(F401)
57-57: Missing return type annotation for public function run
Add return type annotation: None
(ANN201)
57-57: Missing type annotation for function argument bencher
(ANN001)
57-57: Missing type annotation for function argument kwargs
(ANN001)
57-57: Missing type annotation for function argument bench_target
(ANN001)
57-57: Boolean default positional argument in function definition
(FBT002)
57-57: Missing type annotation for function argument attach
(ANN001)
57-57: Boolean default positional argument in function definition
(FBT002)
57-57: Missing type annotation for function argument attach_on_error
(ANN001)
71-71: Do not catch blind exception: Exception
(BLE001)
72-72: datetime.datetime.now()
called without a tz
argument
(DTZ005)
72-72: Use explicit conversion flag
Replace with conversion flag
(RUF010)
86-87: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
assets/elasticsearch/ingest.py
13-13: Missing return type annotation for public function pop_by_path
(ANN201)
13-13: Missing type annotation for function argument obj
(ANN001)
13-13: Missing type annotation for function argument path
(ANN001)
81-81: Missing return type annotation for public function ingest_dataset
Add return type annotation: None
(ANN201)
90-90: Trailing comma missing
Add trailing comma
(COM812)
97-97: Trailing comma missing
Add trailing comma
(COM812)
98-98: Trailing comma missing
Add trailing comma
(COM812)
99-99: Trailing comma missing
Add trailing comma
(COM812)
100-100: Trailing comma missing
Add trailing comma
(COM812)
102-102: Trailing comma missing
Add trailing comma
(COM812)
assets/elasticsearch/main.py
16-16: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
16-16: Missing type annotation for function argument dataset
(ANN001)
16-16: Boolean default positional argument in function definition
(FBT002)
16-16: Missing type annotation for function argument logsdb
(ANN001)
43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb"
instead of if
-else
-block
Replace if
-else
-block with logsdb = "anything" if self.logsdb else "no_logsdb"
(SIM108)
49-49: Trailing comma missing
Add trailing comma
(COM812)
🔇 Additional comments (3)
assets/elasticsearch/ingest.py (2)
67-72
: LGTM!The timestamp extraction logic is well-implemented with appropriate fallback handling for missing timestamp fields.
76-76
: LGTM!Adding
_op_type: "create"
is a good practice for data ingestion to prevent accidental overwrites.scripts/benchall.py (1)
37-55
: LGTM!The benchmark selection appropriately focuses on testing the new elasticsearch logsdb functionality with both True and False values, which aligns with the PR objectives to improve Elasticsearch benchmarking.
if sys.argv[3] != "no_logsdb": | ||
template_body = { | ||
"index_patterns": [collection_name], | ||
"template": { | ||
"settings": { | ||
"index": { | ||
"mode": "logsdb" | ||
}, | ||
}, | ||
"mappings": { | ||
"properties": { | ||
"@timestamp": { | ||
"type": "date", | ||
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz" | ||
} | ||
} | ||
} | ||
}, | ||
"priority": 101 | ||
} | ||
es.indices.put_index_template(name=collection_name, body=template_body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
LGTM! Consider minor formatting improvements.
The index template logic is well-implemented with comprehensive date format support for the @timestamp field. The conditional creation based on the logsdb flag is appropriate.
Consider adding trailing commas for better diff readability:
"settings": {
"index": {
- "mode": "logsdb"
+ "mode": "logsdb",
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
- "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
- }
+ "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",
+ },
}
- }
+ },
},
- "priority": 101
+ "priority": 101,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if sys.argv[3] != "no_logsdb": | |
template_body = { | |
"index_patterns": [collection_name], | |
"template": { | |
"settings": { | |
"index": { | |
"mode": "logsdb" | |
}, | |
}, | |
"mappings": { | |
"properties": { | |
"@timestamp": { | |
"type": "date", | |
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz" | |
} | |
} | |
} | |
}, | |
"priority": 101 | |
} | |
es.indices.put_index_template(name=collection_name, body=template_body) | |
if sys.argv[3] != "no_logsdb": | |
template_body = { | |
"index_patterns": [collection_name], | |
"template": { | |
"settings": { | |
"index": { | |
"mode": "logsdb", | |
}, | |
}, | |
"mappings": { | |
"properties": { | |
"@timestamp": { | |
"type": "date", | |
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz", | |
}, | |
}, | |
}, | |
}, | |
"priority": 101, | |
} | |
es.indices.put_index_template(name=collection_name, body=template_body) |
🧰 Tools
🪛 Ruff (0.12.2)
90-90: Trailing comma missing
Add trailing comma
(COM812)
97-97: Trailing comma missing
Add trailing comma
(COM812)
98-98: Trailing comma missing
Add trailing comma
(COM812)
99-99: Trailing comma missing
Add trailing comma
(COM812)
100-100: Trailing comma missing
Add trailing comma
(COM812)
102-102: Trailing comma missing
Add trailing comma
(COM812)
🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 84 to 104, add trailing commas
after the last elements in dictionaries and lists within the template_body
definition. This includes after the "mode": "logsdb" setting, the "@timestamp"
mapping dictionary, and the overall dictionary entries. Adding these trailing
commas improves diff readability and consistency in formatting.
def __init__(self, dataset, logsdb=True): | ||
super().__init__(dataset) | ||
|
||
self.logsdb = logsdb | ||
|
||
if not logsdb: | ||
self.properties["notes"] = "no logsdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
LGTM! Consider minor improvements for code quality.
The logsdb parameter addition is well-implemented with appropriate default value and property tracking.
Consider addressing the static analysis suggestions:
- def __init__(self, dataset, logsdb=True):
+ def __init__(self, dataset: str, *, logsdb: bool = True) -> None:
Using keyword-only argument (*
) prevents the boolean parameter pitfall flagged by static analysis.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__(self, dataset, logsdb=True): | |
super().__init__(dataset) | |
self.logsdb = logsdb | |
if not logsdb: | |
self.properties["notes"] = "no logsdb" | |
def __init__(self, dataset: str, *, logsdb: bool = True) -> None: | |
super().__init__(dataset) | |
self.logsdb = logsdb | |
if not logsdb: | |
self.properties["notes"] = "no logsdb" |
🧰 Tools
🪛 Ruff (0.12.2)
16-16: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
16-16: Missing type annotation for function argument dataset
(ANN001)
16-16: Boolean default positional argument in function definition
(FBT002)
16-16: Missing type annotation for function argument logsdb
(ANN001)
🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 16 to 22, the logsdb parameter
should be made keyword-only to avoid boolean parameter pitfalls flagged by
static analysis. Modify the __init__ method signature to include a * before
logsdb, making logsdb a keyword-only argument. This improves code clarity and
prevents accidental positional argument misuse.
if self.logsdb: | ||
logsdb = "anything" | ||
else: | ||
logsdb = "no_logsdb" | ||
|
||
self.docker_execute([ | ||
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}" | ||
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}" | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
LGTM! Consider simplifying the conditional logic.
The updated ingest method correctly passes the required arguments to the ingestion script, including the new logsdb flag and timestamp path.
Consider simplifying the conditional logic and adding trailing comma:
- if self.logsdb:
- logsdb = "anything"
- else:
- logsdb = "no_logsdb"
+ logsdb = "anything" if self.logsdb else "no_logsdb"
self.docker_execute([
- f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
+ f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",
])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self.logsdb: | |
logsdb = "anything" | |
else: | |
logsdb = "no_logsdb" | |
self.docker_execute([ | |
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}" | |
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}" | |
]) | |
logsdb = "anything" if self.logsdb else "no_logsdb" | |
self.docker_execute([ | |
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}", | |
]) |
🧰 Tools
🪛 Ruff (0.12.2)
43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb"
instead of if
-else
-block
Replace if
-else
-block with logsdb = "anything" if self.logsdb else "no_logsdb"
(SIM108)
49-49: Trailing comma missing
Add trailing comma
(COM812)
🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 43 to 50, simplify the conditional
assignment of logsdb by using a single line conditional expression instead of an
if-else block. Also, add a trailing comma to the list passed to
self.docker_execute for consistency and style.
] | ||
|
||
def run(bencher, kwargs, bench_target, attach=False): | ||
def run(bencher, kwargs, bench_target, attach=False, attach_on_error=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
LGTM! The attach_on_error parameter enhances debugging capabilities.
The new parameter provides useful flexibility for debugging benchmark failures without requiring full attachment mode.
Consider addressing the static analysis suggestions for boolean parameters and type annotations to improve code quality.
🧰 Tools
🪛 Ruff (0.12.2)
57-57: Missing return type annotation for public function run
Add return type annotation: None
(ANN201)
57-57: Missing type annotation for function argument bencher
(ANN001)
57-57: Missing type annotation for function argument kwargs
(ANN001)
57-57: Missing type annotation for function argument bench_target
(ANN001)
57-57: Boolean default positional argument in function definition
(FBT002)
57-57: Missing type annotation for function argument attach
(ANN001)
57-57: Boolean default positional argument in function definition
(FBT002)
57-57: Missing type annotation for function argument attach_on_error
(ANN001)
🤖 Prompt for AI Agents
In scripts/benchall.py at line 57, the run function is missing type annotations
for its boolean parameters. Add type annotations for the attach and
attach_on_error parameters, specifying them as bool, and ensure all other
parameters are also properly annotated for clarity and static analysis
compliance.
if len(sys.argv) > 1: | ||
if dataset_name != sys.argv[1].strip(): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
LGTM! Consider simplifying the conditional logic.
The dataset filtering capability is a useful addition for targeted benchmark execution.
Consider combining the nested if statements as suggested by static analysis:
- if len(sys.argv) > 1:
- if dataset_name != sys.argv[1].strip():
- continue
+ if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip():
+ continue
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if len(sys.argv) > 1: | |
if dataset_name != sys.argv[1].strip(): | |
continue | |
if len(sys.argv) > 1 and dataset_name != sys.argv[1].strip(): | |
continue |
🧰 Tools
🪛 Ruff (0.12.2)
86-87: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
🤖 Prompt for AI Agents
In scripts/benchall.py at lines 86 to 88, the nested if statements checking
sys.argv length and comparing dataset_name can be combined into a single
conditional statement; refactor these lines to use a single if statement that
performs both checks together for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to break up the PR
Sorry, this PR was not cherry-picked correctly. lock clickhouse version and older commits are already merged. improve benchall.py only running on one dataset is an irrelevant and short change. elasticsearch changes and later commits are part of the PR. I don't know how to fix this issue without redoing the cherry pick and force pushing, but I'd like confirmation before I do this. |
Hey John @PacificViking, I'm marking this PR as I can take it from here. If there's anything I need to ask you, I will DM you. Thanks again for the explanations. |
Description
Improves the outdated version and uses logsdb methodology as mentioned in #7 .
Checklist
breaking change.
Validation performed
New elasticsearch benchmark runs.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Documentation