Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions assets/clickhouse/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
FROM clickhouse/clickhouse-server:latest

FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Good pinning; add HEALTHCHECK & non-root user

Pinning the ClickHouse image by digest is 👍. Consider adding:

  1. A non-root user to follow container hardening guidelines.
  2. A simple HEALTHCHECK so orchestration tools can detect crash-loops.

Example:

 FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
+
+# Create non-root user
+RUN useradd --system --uid 1001 clickhouse
+USER clickhouse
+
+# Basic liveness probe
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
+  clickhouse-client --query "SELECT 1" || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
FROM clickhouse/clickhouse-server:25.6@sha256:7566eea413755c8fc43a7b39078b05f16ed350bb33c98f2f087c5f66b0e8163f
# Create non-root user
RUN useradd --system --uid 1001 clickhouse
USER clickhouse
# Basic liveness probe
HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD \
clickhouse-client --query "SELECT 1" || exit 1
🧰 Tools
🪛 Checkov (3.2.334)

[LOW] 1-1: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-1: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🤖 Prompt for AI Agents
In assets/clickhouse/Dockerfile at line 1, the base image is pinned by digest
which is good, but the Dockerfile lacks a HEALTHCHECK instruction and does not
switch to a non-root user. To fix this, add a HEALTHCHECK command that
periodically verifies the ClickHouse server is responsive, and create or switch
to a non-root user before running the server to improve container security and
comply with best practices.

4 changes: 2 additions & 2 deletions assets/clickhouse/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"""

CLICKHOUSE_COLLECTION_NAME = "clickhouse_bench"
class clickhouse_native_json_bench(Benchmark):
class clickhouse_bench(Benchmark):
# add any parameters to the tool here
def __init__(self, dataset, manual_column_names=True, keys=[], additional_order_by=[], timestamp_key=False):
super().__init__(dataset)
Expand Down Expand Up @@ -154,7 +154,7 @@ def run_applicable(self, dataset_name):


def main():
bench = clickhouse_native_json_bench(sys.argv[1])
bench = clickhouse_bench(sys.argv[1])
bench.run_everything()

if __name__ == "__main__":
Expand Down
4 changes: 2 additions & 2 deletions assets/clickhouse/methodology.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ClickHouse methodology

**Version: [23.3.1.2823](https://hub.docker.com/layers/clickhouse/clickhouse-server/23.3.1.2823/images/sha256-b88fd8c71b64d3158751337557ff089ff7b0d1ebf81d9c4c7aa1f0b37a31ee64?context=explore)**
**Version: [25.6](https://hub.docker.com/layers/clickhouse/clickhouse-server/25.6/images/sha256-77ff1f2054e27bec1b4c5eccbf701b63f8409831fea71f162ae2f25872dee1f4)**

**File with Formatted Queries:** [Config File](/assets/clickhouse/config.yaml)

Expand Down Expand Up @@ -67,4 +67,4 @@ On launch we execute ClickHouse’s entrypoint.sh to start ClickHouse in the bac
On shutdown, we simply send ClickHouse a stop command.

**Clearing Caches:**
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches.
Before starting any queries we ask ClickHouse to drop its UNCOMPRESSED CACHE and MARK CACHE. We also flush the file system buffers and clear the filesystem caches.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor typo: possessive “ClickHouse’s” vs plural

“ClickHouse’s provided docker images” → “ClickHouse-provided Docker images” for clarity.
Very small nit—feel free to ignore.

🤖 Prompt for AI Agents
In assets/clickhouse/methodology.md at line 70, change the phrase "ClickHouse’s
provided docker images" to "ClickHouse-provided Docker images" to correct the
possessive typo and improve clarity. Also capitalize "Docker" as it is a proper
noun.

4 changes: 2 additions & 2 deletions assets/clp/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

CLP_OUT_PATH = f"{WORK_DIR}/archives"
CLP_S_BINARY = "/clp/clp-s"
class clp_s_bench(Benchmark):
class clp_bench(Benchmark):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Rename class to PascalCase to respect PEP-8

PEP-8 prescribes CapWords for class names. Renaming clp_benchClpBench will align this file with the rest of the codebase and with other renamed benchmark classes.

-class clp_bench(Benchmark):
+class ClpBench(Benchmark):

You will also need to adjust the instantiation on Line 64 and any external imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class clp_bench(Benchmark):
class ClpBench(Benchmark):
🤖 Prompt for AI Agents
In assets/clp/main.py at line 8, rename the class from clp_bench to ClpBench to
follow PEP-8 PascalCase naming conventions for classes. Also update any
instantiations of this class around line 64 and adjust any external imports or
references to use the new class name ClpBench.

def __init__(self, dataset, target_encoded_size=268435456):
super().__init__(dataset)

Expand Down Expand Up @@ -61,7 +61,7 @@ def terminate_procs(self):


def main():
bench = clp_s_bench(sys.argv[1])
bench = clp_bench(sys.argv[1])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Instantiation must match updated class name

If you adopt the PascalCase rename, remember to update the constructor here:

-    bench = clp_bench(sys.argv[1])
+    bench = ClpBench(sys.argv[1])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bench = clp_bench(sys.argv[1])
bench = ClpBench(sys.argv[1])
🤖 Prompt for AI Agents
In assets/clp/main.py at line 64, the instantiation of the class uses the old
class name clp_bench. Update this line to use the new PascalCase class name by
replacing clp_bench with ClpBench to match the renamed class constructor.

bench.run_everything()

if __name__ == "__main__":
Expand Down
10 changes: 5 additions & 5 deletions assets/elasticsearch/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-jammy:main
from ubuntu:jammy

RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y \
Expand All @@ -8,17 +8,17 @@ RUN apt-get update \
RUN curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch \
| gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg]" \
"https://artifacts.elastic.co/packages/8.x/apt stable main" \
| tee /etc/apt/sources.list.d/elastic-8.x.list;
"https://artifacts.elastic.co/packages/9.x/apt stable main" \
| tee /etc/apt/sources.list.d/elastic-9.x.list;

RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y \
elasticsearch=8.6.2 \
elasticsearch=9.0.3 \
libcurl4 \
libcurl4-openssl-dev \
python3-pip \
python3-venv \
tmux \
vim;

RUN pip3 install elasticsearch==8.6.2 requests;
RUN pip3 install elasticsearch==9.0.3 requests;
39 changes: 39 additions & 0 deletions assets/elasticsearch/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@

log_path = sys.argv[1]

def pop_by_path(obj, path):
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
Comment on lines +13 to +19
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional: Add type annotations for better code documentation

Consider adding type annotations to improve code readability and enable better IDE support.

-def pop_by_path(obj, path):
+def pop_by_path(obj: dict, path: str) -> any:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def pop_by_path(obj, path):
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
def pop_by_path(obj: dict, path: str) -> any:
keys = path.split('.')
for key in keys[:-1]:
obj = obj[key]
out = obj[keys[-1]]
del obj[keys[-1]]
return out
🧰 Tools
🪛 Ruff (0.12.2)

13-13: Missing return type annotation for public function pop_by_path

(ANN201)


13-13: Missing type annotation for function argument obj

(ANN001)


13-13: Missing type annotation for function argument path

(ANN001)

🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 13 to 19, the function
pop_by_path lacks type annotations. Add appropriate type hints for the function
parameters and return type, such as specifying that obj is a dictionary with
string keys and any type of values, path is a string, and the return type is
any, to improve code readability and IDE support.


def traverse_data(collection_name):
with open(log_path, encoding="utf-8") as f:
for line in f:
Expand Down Expand Up @@ -55,15 +63,46 @@ def traverse_data(collection_name):
id_value = str(attr["query"]["_id"])
json_line["attr"]["query"]["_id"] = {}
json_line["attr"]["query"]["_id"]["_ooid"] = id_value

try:
timestamp_val = pop_by_path(json_line, sys.argv[2])
json_line["@timestamp"] = timestamp_val
except KeyError:
# no such timestamp, ignore
json_line["@timestamp"] = 0

yield {
"_index": collection_name,
"_op_type": "create",
"_source": json_line,
}


def ingest_dataset():
es = Elasticsearch("http://localhost:9202", request_timeout=1200, retry_on_timeout=True)

if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb"
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
}
}
}
},
"priority": 101
}
es.indices.put_index_template(name=collection_name, body=template_body)
Comment on lines +84 to +104
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider minor formatting improvements.

The index template logic is well-implemented with comprehensive date format support for the @timestamp field. The conditional creation based on the logsdb flag is appropriate.

Consider adding trailing commas for better diff readability:

                 "settings": {
                     "index": {
-                        "mode": "logsdb"
+                        "mode": "logsdb",
                     },
                 },
                 "mappings": {
                     "properties": {
                         "@timestamp": {
                             "type": "date",
-                            "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
-                            }
+                            "format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",
+                            },
                     }
-                }
+                },
             },
-            "priority": 101
+            "priority": 101,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb"
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz"
}
}
}
},
"priority": 101
}
es.indices.put_index_template(name=collection_name, body=template_body)
if sys.argv[3] != "no_logsdb":
template_body = {
"index_patterns": [collection_name],
"template": {
"settings": {
"index": {
"mode": "logsdb",
},
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "date_optional_time||epoch_second||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS zzz",
},
},
},
},
"priority": 101,
}
es.indices.put_index_template(name=collection_name, body=template_body)
🧰 Tools
🪛 Ruff (0.12.2)

90-90: Trailing comma missing

Add trailing comma

(COM812)


97-97: Trailing comma missing

Add trailing comma

(COM812)


98-98: Trailing comma missing

Add trailing comma

(COM812)


99-99: Trailing comma missing

Add trailing comma

(COM812)


100-100: Trailing comma missing

Add trailing comma

(COM812)


102-102: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents
In assets/elasticsearch/ingest.py around lines 84 to 104, add trailing commas
after the last elements in dictionaries and lists within the template_body
definition. This includes after the "mode": "logsdb" setting, the "@timestamp"
mapping dictionary, and the overall dictionary entries. Adding these trailing
commas improves diff readability and consistency in formatting.


count = 0
for success, info in streaming_bulk(
es,
Expand Down
14 changes: 12 additions & 2 deletions assets/elasticsearch/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,14 @@

class elasticsearch_bench(Benchmark):
# add any parameters to the tool here
def __init__(self, dataset):
def __init__(self, dataset, logsdb=True):
super().__init__(dataset)

self.logsdb = logsdb

if not logsdb:
self.properties["notes"] = "no logsdb"
Comment on lines +16 to +22
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider minor improvements for code quality.

The logsdb parameter addition is well-implemented with appropriate default value and property tracking.

Consider addressing the static analysis suggestions:

-    def __init__(self, dataset, logsdb=True):
+    def __init__(self, dataset: str, *, logsdb: bool = True) -> None:

Using keyword-only argument (*) prevents the boolean parameter pitfall flagged by static analysis.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, dataset, logsdb=True):
super().__init__(dataset)
self.logsdb = logsdb
if not logsdb:
self.properties["notes"] = "no logsdb"
def __init__(self, dataset: str, *, logsdb: bool = True) -> None:
super().__init__(dataset)
self.logsdb = logsdb
if not logsdb:
self.properties["notes"] = "no logsdb"
🧰 Tools
🪛 Ruff (0.12.2)

16-16: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


16-16: Missing type annotation for function argument dataset

(ANN001)


16-16: Boolean default positional argument in function definition

(FBT002)


16-16: Missing type annotation for function argument logsdb

(ANN001)

🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 16 to 22, the logsdb parameter
should be made keyword-only to avoid boolean parameter pitfalls flagged by
static analysis. Modify the __init__ method signature to include a * before
logsdb, making logsdb a keyword-only argument. This improves code clarity and
prevents accidental positional argument misuse.


@property
def compressed_size(self):
"""
Expand All @@ -35,8 +40,13 @@ def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
if self.logsdb:
logsdb = "anything"
else:
logsdb = "no_logsdb"

self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
])
Comment on lines +43 to 50
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

LGTM! Consider simplifying the conditional logic.

The updated ingest method correctly passes the required arguments to the ingestion script, including the new logsdb flag and timestamp path.

Consider simplifying the conditional logic and adding trailing comma:

-        if self.logsdb:
-            logsdb = "anything"
-        else:
-            logsdb = "no_logsdb"
+        logsdb = "anything" if self.logsdb else "no_logsdb"

         self.docker_execute([
-            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
+            f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",
             ])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.logsdb:
logsdb = "anything"
else:
logsdb = "no_logsdb"
self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path}"
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}"
])
logsdb = "anything" if self.logsdb else "no_logsdb"
self.docker_execute([
f"python3 {ASSETS_DIR}/ingest.py {self.datasets_path} {self.dataset_meta['timestamp']} {logsdb}",
])
🧰 Tools
🪛 Ruff (0.12.2)

43-46: Use ternary operator logsdb = "anything" if self.logsdb else "no_logsdb" instead of if-else-block

Replace if-else-block with logsdb = "anything" if self.logsdb else "no_logsdb"

(SIM108)


49-49: Trailing comma missing

Add trailing comma

(COM812)

🤖 Prompt for AI Agents
In assets/elasticsearch/main.py around lines 43 to 50, simplify the conditional
assignment of logsdb by using a single line conditional expression instead of an
if-else block. Also, add a trailing comma to the list passed to
self.docker_execute for consistency and style.


def search(self, query):
Expand Down
4 changes: 3 additions & 1 deletion assets/elasticsearch/methodology.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Elasticsearch methodology

**Version:** [8.6.2](https://www.elastic.co/downloads/past-releases/elasticsearch-8-6-2)
**Version:** [9.0.3](https://www.elastic.co/downloads/past-releases/elasticsearch-9-0-3)

**File with Formatted Queries:** [Config File](/assets/elasticsearch/config.yaml)

Expand All @@ -16,6 +16,8 @@ We disable the [xpack](https://www.elastic.co/guide/en/elasticsearch/reference/c

Some preprocessing is necessary to make the dataset searchable in Elasticsearch. For more details, refer to the `traverse_data` function in [ingest.py](/assets/elasticsearch/ingest.py). This process generally involves reorganizing specific fields, moving them into outer or inner objects to ensure proper query functionality.

We use the [logs data stream](https://www.elastic.co/docs/manage-data/data-store/data-streams/logs-data-stream) which is optimized for timestamped logs

### Launch & Shutdown
On launch the benchmark framework calls the [launch.sh](/assets/elasticsearch/launch.sh) script. This script automates the configuration of an Elasticsearch instance by modifying its settings to change the HTTP port, disable security features, and ensure it runs in single-node mode. It also updates the `elasticsearch` user settings to allow login and starts the Elasticsearch service in the background.

Expand Down
4 changes: 2 additions & 2 deletions assets/gzip/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
# dependencies

# If there is any dedicated image available, you should build the benchmarking image on top of that
FROM ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-jammy:main
FROM ubuntu:jammy

# Install necessary packages
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades \
tmux \
vim \
gzip
Expand Down
6 changes: 3 additions & 3 deletions assets/gzip/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,23 +37,23 @@ def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(f"gzip -c {self.datasets_path} > {GZIP_FILE_PATH} ")
self.docker_execute(f"gzip -c {self.datasets_path} > {GZIP_FILE_PATH} ", shell=True)

def search(self, query):
"""
Searches an already-ingested dataset with query, which is populated within config.yaml
"""
assert query == "notasearch"

self.docker_execute(f"gunzip -c {GZIP_FILE_PATH} > {DECOMPRESSED_FILE_PATH}")
self.docker_execute(f"gunzip -c {GZIP_FILE_PATH} > {DECOMPRESSED_FILE_PATH}", shell=True)
return 0
def bench_search(self, cold=True):
if not cold:
logger.info("hot run and cold run are the same (decompression), skipping")
return

self.bench_start(ingest=False)
self.docker_execute(f"gunzip -c {GZIP_FILE_PATH} > {DECOMPRESSED_FILE_PATH}")
self.docker_execute(f"gunzip -c {GZIP_FILE_PATH} > {DECOMPRESSED_FILE_PATH}", shell=True)
self.bench_stop()

logger.info("Decompression done")
Expand Down
2 changes: 1 addition & 1 deletion assets/overhead_test/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# dependencies

# If there is any dedicated image available, you should build the benchmarking image on top of that
FROM ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-jammy:main
FROM ubuntu:jammy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm missing tool-chain & clean up apt cache after the base-image switch

Switching from the bespoke clp-core-dependencies image to plain ubuntu:jammy drops a lot of build/run deps that the earlier image pre-installed (e.g., build-essential, python3, curl, libraries for CLP, etc.). Please double-check that all binaries used by the overhead-test still exist at runtime, and add the required packages explicitly.

While touching the layer, it is also best practice to clean up the apt cache to keep the image small:

 RUN apt-get update \
-    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
-    tmux \
-    vim
+    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
+       tmux \
+       vim \
+    && rm -rf /var/lib/apt/lists/*

🏁 Script executed:

#!/bin/bash
set -eux

# Locate the overhead-test Dockerfile
dockerfile_path=$(fd "overhead_test/Dockerfile" -t f)
echo "Found Dockerfile at: $dockerfile_path"

# Show the first 50 lines to inspect the base image and install steps
sed -n '1,50p' "$dockerfile_path"

Length of output: 180


🏁 Script executed:

#!/bin/bash
set -eux

# Locate the Dockerfile under assets/overhead_test
dockerfile_path=$(find . -type f -path "*/assets/overhead_test/Dockerfile" | head -n1)
echo "Found Dockerfile at: $dockerfile_path"

# Print the first 50 lines to inspect base image and install steps
sed -n '1,50p' "$dockerfile_path"

Length of output: 661


🏁 Script executed:

#!/bin/bash
set -eux

# List all files under assets/overhead_test
echo "=== Files in assets/overhead_test ==="
find assets/overhead_test -type f

# Search for common toolchain and runtime dependencies
echo "=== Searching for Python scripts ==="
rg --no-heading --color never '#!.*python' assets/overhead_test || true
rg --no-heading --color never '\.py' assets/overhead_test || true

echo "=== Searching for curl usage ==="
rg --no-heading --color never 'curl' assets/overhead_test || true

echo "=== Searching for gcc/g++ usage ==="
rg --no-heading --color never '\bgcc\b|\bg\+\+\b' assets/overhead_test || true

echo "=== Searching for make usage ==="
rg --no-heading --color never '\bmake\b' assets/overhead_test || true

Length of output: 1009


Ensure Python3 is installed and clean up apt cache

The main.py script uses #!/usr/bin/env python3, so python3 must be explicitly installed on the plain ubuntu:jammy base. While you’re touching this layer, it’s also best practice to remove the apt cache to keep the image small.

• File: assets/overhead_test/Dockerfile
• Around line 5

Suggested diff:

 RUN apt-get update \
-    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
-       tmux \
-       vim
+    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
+       tmux \
+       vim \
+       python3 \
+    && rm -rf /var/lib/apt/lists/*

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In assets/overhead_test/Dockerfile at line 5, the base image ubuntu:jammy does
not include python3, which is required by main.py. Modify the Dockerfile to
install python3 explicitly using apt-get, and after installation, clean up the
apt cache by removing /var/lib/apt/lists to reduce image size.


# Install necessary packages
RUN apt-get update \
Expand Down
4 changes: 2 additions & 2 deletions assets/presto_clp/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
CLP_PRESTO_HOST_STORAGE = os.path.abspath(os.path.expanduser("~/clp-json-x86_64-v0.4.0-dev"))
SQL_PASSWORD = "wqEGPyBdx_w"
HOST_IP = "127.0.0.1"
class clp_presto_bench(Benchmark):
class presto_clp_bench(Benchmark):
# add any parameters to the tool here
def __init__(self, dataset, dataset_variation='cleaned_log'):
super().__init__(dataset, dataset_variation=dataset_variation)
Expand Down Expand Up @@ -102,7 +102,7 @@ def terminate(self):


def main():
bench = clp_presto_bench(sys.argv[1])
bench = presto_clp_bench(sys.argv[1])
bench.run_everything()

if __name__ == "__main__":
Expand Down
20 changes: 13 additions & 7 deletions assets/presto_parquet/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM presto/prestissimo-dependency:ubuntu-22.04
FROM presto/prestissimo-dependency:ubuntu-22.04-presto-0.293

# Install necessary packages
RUN apt-get update;
Expand Down Expand Up @@ -35,22 +35,28 @@ RUN cd /usr/local && sh -c "$(curl --location https://taskfile.dev/install.sh)"
RUN pip3 install pyarrow;

WORKDIR /home
RUN git clone https://github.com/anlowee/presto.git;
RUN git clone https://github.com/y-scope/presto.git;
WORKDIR /home/presto
RUN git checkout xwei/benchmark-test;
RUN git checkout 89ce0f3b4ec713d658f3544e75aeb92fbd3a397d;
WORKDIR /home/presto/presto-native-execution
RUN mkdir build;
RUN rm -rf velox;
RUN git clone https://github.com/anlowee/velox.git;
RUN git clone https://github.com/y-scope/velox.git;
WORKDIR /home/presto/presto-native-execution/velox
RUN git checkout xwei/benchmark-test;
RUN git checkout 27ee4bcaad449fd1c8b90c48787f4eaf8e92395f;

WORKDIR /home/presto
RUN ./mvnw clean install -DskipTests -pl -presto-docs -T1C;

WORKDIR /home/presto/presto-native-execution/build
RUN cmake .. && make -j$(nproc) presto_server;
WORKDIR /home/presto/presto-native-execution
# RUN cmake .. && make -j$(nproc) presto_server;
RUN NUM_THREADS=8 make release;

Comment on lines +51 to 54
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Dead commented build step: please delete or document

The # RUN cmake .. && make … presto_server; line is now obsolete after switching to make release. Leaving it commented invites drift and confuses readers.

Either remove it or add a comment explaining why the alternative remains for archaeology.

🤖 Prompt for AI Agents
In assets/presto_parquet/Dockerfile around lines 51 to 54, the commented out
build step using cmake and make is obsolete and may confuse readers. Remove the
commented line entirely or add a clear comment explaining why it is kept for
historical reference, ensuring the Dockerfile remains clean and understandable.

ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH=$JAVA_HOME/bin:$PATH

RUN apt-get update;

RUN apt-get install -y \
mariadb-client \
net-tools;
Comment on lines +58 to +62
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Second apt-get layer leaks cache and misses version pinning

apt-get update/install is repeated, but the cleanup flags used earlier are missing:

  1. /var/lib/apt/lists isn’t purged → larger image.
  2. Package versions are unpinned (mariadb-client, net-tools) → non-deterministic rebuilds.
-RUN apt-get update;
-RUN apt-get install -y \
-    mariadb-client \
-    net-tools;
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        mariadb-client=1:10.6.* \
+        net-tools=1.60* && \
+    rm -rf /var/lib/apt/lists/*

Keeps layers reproducible and lean.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RUN apt-get update;
RUN apt-get install -y \
mariadb-client \
net-tools;
RUN apt-get update && \
apt-get install -y --no-install-recommends \
mariadb-client=1:10.6.* \
net-tools=1.60* && \
rm -rf /var/lib/apt/lists/*
🧰 Tools
🪛 Checkov (3.2.334)

[LOW] 1-62: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)


[LOW] 1-62: Ensure that a user for the container has been created

(CKV_DOCKER_3)

🪛 Hadolint (2.12.0)

[info] 58-58: Delete the apt-get lists after installing something

(DL3009)


[warning] 60-60: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)


[info] 60-60: Avoid additional packages by specifying --no-install-recommends

(DL3015)


[info] 60-60: Multiple consecutive RUN instructions. Consider consolidation.

(DL3059)

🤖 Prompt for AI Agents
In assets/presto_parquet/Dockerfile around lines 58 to 62, the apt-get update
and install commands create separate layers without cleaning up apt cache,
causing larger image size and potential cache leaks. Combine the update and
install commands into a single RUN instruction, add cleanup of
/var/lib/apt/lists after installation, and pin the package versions explicitly
to ensure deterministic and lean builds.

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
discovery.uri=http://127.0.0.1:8080
presto.version=0.293-SNAPSHOT-faae543
presto.version=0.293-89ce0f3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Version string now diverges from directory naming used elsewhere

presto.version embeds the commit hash (0.293-89ce0f3), while the paths hard-coded in main.py (Lines 66 & 72) refer to presto-server-0.293 and presto-cli-0.293-executable.jar. This mismatch is harmless at runtime but will cause confusion when the next bump happens.

Recommend either:

  1. Derive the runtime paths from this property, or
  2. Amend the property to exactly match the artefact directory name.

This keeps observability tools and grep searches consistent.

🤖 Prompt for AI Agents
In assets/presto_parquet/include/etc_worker/config.properties at line 2, the
presto.version property includes a commit hash making it differ from the
directory names used in main.py lines 66 and 72. To fix this, either update
presto.version to only include the version number matching the directory names
(e.g., "0.293") or modify main.py to derive the directory paths dynamically from
the full presto.version property to ensure consistency and avoid confusion in
observability and searches.

http-server.http.port=7777
shutdown-onset-sec=1
register-test-functions=false
Expand Down
8 changes: 4 additions & 4 deletions assets/presto_parquet/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,20 +63,20 @@ def launch(self):
"""
Runs the benchmarked tool
"""
self.docker_execute("bash -c \"python3 /home/presto/presto-server/target/presto-server-0.293-SNAPSHOT/bin/launcher.py run --etc-dir=/home/include/etc_coordinator\"", background=True)
self.docker_execute("bash -c \"python3 /home/presto/presto-server/target/presto-server-0.293/bin/launcher.py run --etc-dir=/home/include/etc_coordinator\"", background=True)
self.wait_for_port(8080)
self.docker_execute("/home/presto/presto-native-execution/build/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
self.docker_execute("/home/presto/presto-native-execution/_build/release/presto_cpp/main/presto_server --logtostderr=1 --etc_dir=/home/include/etc_worker", background=True)
self.wait_for_port(7777)
time.sleep(60) # this needs to be more than 10
Comment on lines +66 to 70
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded versioned paths make future bumps brittle

/home/presto/presto-server/target/presto-server-0.293/... is embedded directly in the command. The same string occurs in multiple places. Every future upgrade will require hunting for these literals.

-"/home/presto/presto-server/target/presto-server-0.293/bin/launcher.py"
+f"/home/presto/presto-server/target/presto-server-{self.properties['presto_version']}/bin/launcher.py"

Populate presto_version once (e.g., from etc_worker/config.properties) and reuse it.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 66 to 70, the presto server
version "0.293" is hard-coded in multiple command strings, making future
upgrades error-prone. To fix this, read the presto_version once from
etc_worker/config.properties or a similar config file, store it in a variable,
and then use that variable to construct the versioned paths dynamically in the
docker_execute commands. This eliminates repeated literals and simplifies
version bumps.

Comment on lines +68 to 70
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Repeated polling sleep hides readiness issues

A fixed time.sleep(60) after launching the native engine masks startup failures and slows fast boots. Prefer an explicit health probe loop (e.g., curl v1/info or check the TCP port) with a timeout.

This will surface crashes early and shave minutes off CI when the service starts quickly.

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 68 to 70, replace the fixed
time.sleep(60) after starting the native engine with a loop that explicitly
checks service readiness by polling a health endpoint like v1/info or verifying
the TCP port is open, with a timeout to avoid indefinite waiting. This change
will detect startup failures promptly and reduce unnecessary delays during fast
startups.


def hive_execute(self, query, check=True):
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)

def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')

Comment on lines 72 to 80
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add type annotations + avoid positional boolean default

Static analysis (Ruff ANN/F-BT) flags hive_execute/ingest. Low-cost improvement:

-    def hive_execute(self, query, check=True):
+    def hive_execute(self, query: str, check: bool = True) -> str:
...
-    def ingest(self):
+    def ingest(self) -> None:

Improves IDE autocomplete and enforces the public API.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def hive_execute(self, query, check=True):
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
return self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"', check)
def ingest(self):
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-SNAPSHOT-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
self.docker_execute(f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar --catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"')
def hive_execute(self, query: str, check: bool = True) -> str:
return self.docker_execute(
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute "{query}"',
check,
)
def ingest(self) -> None:
"""
Ingests the dataset at self.datasets_path
"""
self.docker_execute(
f'/home/presto/presto-cli/target/presto-cli-0.293-executable.jar '
f'--catalog hive --schema {PARQUET_SCHEMA_NAME} --execute '
f'"CREATE SCHEMA IF NOT EXISTS hive.{PARQUET_SCHEMA_NAME};"',
)
🧰 Tools
🪛 Ruff (0.12.2)

72-72: Missing return type annotation for public function hive_execute

(ANN201)


72-72: Missing type annotation for function argument query

(ANN001)


72-72: Boolean default positional argument in function definition

(FBT002)


72-72: Missing type annotation for function argument check

(ANN001)


75-75: Missing return type annotation for public function ingest

Add return type annotation: None

(ANN201)

🤖 Prompt for AI Agents
In assets/presto_parquet/main.py around lines 72 to 80, the methods hive_execute
and ingest lack type annotations and use a positional boolean default parameter.
Add explicit type annotations for the method parameters and return types, and
change the boolean parameter in hive_execute to a keyword-only argument with a
default value to avoid positional boolean defaults. This will improve static
analysis, IDE autocomplete, and clarify the public API.

if self.pairwise_arrays:
self.hive_execute(f""" \
Expand Down
2 changes: 1 addition & 1 deletion assets/template/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# dependencies

# If there is any dedicated image available, you should build the benchmarking image on top of that
FROM ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-jammy:main
FROM ubuntu:jammy

# Install necessary packages
RUN apt-get update \
Expand Down
Loading