Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3a84840
Init
anlowee Jul 27, 2025
9574059
Add new files to yaml linting; Fix yamllint violations.
kirkrodrigues Jul 27, 2025
8dd8795
Remove unnecessarily blank lines.
kirkrodrigues Jul 27, 2025
a8af40c
Remove unnecesary script.
kirkrodrigues Jul 27, 2025
2015a4a
Replace demo-assets/init.sh and demo CLP config file with more robust…
kirkrodrigues Jul 27, 2025
b78682f
Add missing return on error.
kirkrodrigues Jul 27, 2025
1247bba
Apply shell linters to worker/scripts/generate-configs.sh
kirkrodrigues Jul 28, 2025
1400a90
Use Docker compose to wait for the coordinator to be ready.
kirkrodrigues Jul 28, 2025
91040d1
Use jq to parse Presto version info.
kirkrodrigues Jul 28, 2025
9ec648c
Clean-up wget command.
kirkrodrigues Jul 28, 2025
4097478
Use /usr/bin/env.
kirkrodrigues Jul 28, 2025
b1ce135
Use function to update kv-pairs in config file. Set kv-pairs if they …
kirkrodrigues Jul 28, 2025
70a56d7
Move getting Presto coordinator version into a function.
kirkrodrigues Jul 28, 2025
99ac4e1
Minor edits for consistency.
kirkrodrigues Jul 28, 2025
6dde297
fix: Set error policies.
kirkrodrigues Jul 28, 2025
b22746d
Mark constants readonly.
kirkrodrigues Jul 28, 2025
708756d
Clean-up comments.
kirkrodrigues Jul 28, 2025
2eeda5c
Quote paths.
kirkrodrigues Jul 28, 2025
8bb98f8
Clean-up presto-clp/coordinator/scripts/generate-configs.sh.
kirkrodrigues Jul 28, 2025
9fc487d
Remove deprecated version property.
kirkrodrigues Jul 28, 2025
64e8edf
Alphabetize mounts.
kirkrodrigues Jul 28, 2025
00bd3f1
Rename environment variables for clarity.
kirkrodrigues Jul 28, 2025
28f5b70
fix: Remove spurious equals sign.
kirkrodrigues Jul 28, 2025
cccf9fe
Lint set-up-config.sh.
kirkrodrigues Jul 28, 2025
04f3d2a
Add docs and remove README.
kirkrodrigues Jul 28, 2025
3381eb8
Set coordinator log level to INFO.
kirkrodrigues Jul 28, 2025
bea5bb6
Validate CLP metadata database type.
kirkrodrigues Jul 28, 2025
1f68965
Use logging function rather than echos.
kirkrodrigues Jul 28, 2025
94b0210
Reorder functions.
kirkrodrigues Jul 28, 2025
22c53e0
Add new docs to index.
kirkrodrigues Jul 28, 2025
2f0de28
Add S3 limitation.
kirkrodrigues Jul 28, 2025
d621bf3
Add clone step to docs.
kirkrodrigues Jul 28, 2025
7300d20
Add required CLP version to docs.
kirkrodrigues Jul 28, 2025
4bc2f58
Rename PRESTO_WORKER_HTTPPORT.
kirkrodrigues Jul 28, 2025
80c41d4
Remove unnecessary quotes from env var files.
kirkrodrigues Jul 28, 2025
9d2146b
Address some rabbit feedback.
kirkrodrigues Jul 28, 2025
ff06c62
Address the nested field limitation comment
anlowee Jul 28, 2025
cf11694
Add metadata filter config
anlowee Jul 28, 2025
2618f13
Merge branch 'main' into xwei/yscope-comose
anlowee Jul 28, 2025
782b87d
Docs edits.
kirkrodrigues Jul 28, 2025
8107518
Add error checking for config files not existing.
kirkrodrigues Jul 28, 2025
4d896e5
More docs edits.
kirkrodrigues Jul 28, 2025
713670a
Remove extra spaces.
kirkrodrigues Jul 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions tools/deployment/presto-clp/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Coordinator common
PRESTO_COORDINATOR_HTTPPORT="8080"
PRESTO_COORDINATOR_SERVICENAME="presto-coordinator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Quoted values in .env are passed verbatim – will include the quotes

Docker Compose does not strip " in .env, so PRESTO_COORDINATOR_HTTPPORT will become "8080" (quotes included).
That breaks integer expectations in config.properties and JDBC URLs.
Strip the quotes or switch to YAML-style environment: blocks.

-PRESTO_COORDINATOR_HTTPPORT="8080"
-PRESTO_COORDINATOR_SERVICENAME="presto-coordinator"
+PRESTO_COORDINATOR_HTTPPORT=8080
+PRESTO_COORDINATOR_SERVICENAME=presto-coordinator
@@
-PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE="mysql"
-
+PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE=mysql
+

(Apply similarly to the remaining keys.)

Also applies to: 6-33

🧰 Tools
🪛 dotenv-linter (3.3.0)

[warning] 2-2: [QuoteCharacter] The value has quote characters (', ")


[warning] 3-3: [QuoteCharacter] The value has quote characters (', ")

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/.env lines 2 to 33, the environment variable
values are enclosed in quotes, causing the quotes to be included literally in
the values. Remove the surrounding double quotes from all variable assignments
in this range to ensure the values are interpreted correctly by Docker Compose
and downstream configurations.


# Coordinator clp.properties
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE="mysql"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL="jdbc:mysql://localhost:6001"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME="clp-db"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER="clp-user"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD="123456"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX="clp_"
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_SPLITPROVIDER="mysql"

# Coordinator config.properties
PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORY="1GB"
PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORYPERNODE="1GB"

# Coordinator jvm.config
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE="4G"
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_G1HEAPREGIONSIZE="32M"

# Coordinator log.properties
PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL="DEBUG"

# Coordinator node.properties
PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT="production"

# Worker common
PRESTO_WORKER_HTTPPORT="8080"

# Worker node.properties
PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION="worker-location"

# CLP package archives
CLP_PACKAGE_ARCHIVES=REPLACE_ME
76 changes: 76 additions & 0 deletions tools/deployment/presto-clp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Setup local docker stack for presto + clp

## Install docker

Follow the guide here: [docker]

# Launch clp-package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Convert secondary titles to H2 and fix punctuation for Markdown lint-cleanliness

The document currently contains five #-level headings (Setup…, Launch…, Create Docker Cluster, Use cli: and Delete docker Cluster).
Markdown-lint (MD025/MD026) flags this as multiple H1s and trailing punctuation. Keeping a single top-level H1 improves SEO and stops most linters from failing CI.

-# Launch clp-package
+# ## Launch clp-package-# Create Docker Cluster
+# ## Create Docker Cluster-# Use cli:
+# ## Use CLI-# Delete docker Cluster
+# ## Delete Docker Cluster

Also applies to: 35-41, 48-57, 65-69

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

7-7: Multiple top-level headings in the same document

(MD025, single-title, single-h1)

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/README.md around lines 1 to 7 and also lines
35-41, 48-57, and 65-69, change all secondary titles currently using a single #
(H1) to use ## (H2) to ensure only one top-level H1 heading in the document.
Additionally, remove any trailing punctuation from these headings to comply with
markdown lint rules MD025 and MD026. This will improve SEO and prevent linter
failures in CI.


1. Find the clp-package for test on our official website [clp-json-v0.4.0]. Here is a sample dataset for demo testing: [postgresql dataset].

2. Untar the clp-package and the postgresql dataset.

3. Replace the content of `/path/to/clp-json-package/etc/clp-config.yml` with the output of `demo-assets/init.sh generate_sample_clp_config`.

4. Launch:

```bash
# You probably want to run a python 3.9 or newer virtual environment
sbin/start-clp.sh
```

5. Compress:

```bash
# You can also use your own dataset
sbin/compress.sh --timestamp-key 'timestamp' /path/to/postgresql.log
```

6. Use the following command to update `.env`:

```bash
demo-assets/init.sh update_metadata_config /path/to/clp-json-package
```

# Create Docker Cluster

Create a local docker stack:

```bash
docker compose up
```

To create a docker stack with more than 1 worker (e.g., 3 workers):
```
docker compose up --scale presto-worker=3
```

# Use cli:

After all containers are in "Started" states (check by `docker ps`):

```bash
# On your host
docker exec -it compose-presto-coordinator-1 sh

# In presto-coordinator container
/opt/presto-cli --catalog clp --schema default --server localhost:8080
```

Example query:
```sql
SELECT * FROM default LIMIT 1;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor: qualify SQL fenced block for syntax highlighting

Add sql after the back-ticks.

-```sql
+```sql
 SELECT * FROM default LIMIT 1;

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

61-61: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In tools/deployment/presto-clp/README.md around lines 61 to 63, the SQL code
block is missing the language specifier for syntax highlighting. Add "sql"
immediately after the opening triple backticks to enable proper SQL syntax
highlighting for the code block.


</details>

<!-- fingerprinting:phantom:triton:cougar -->

<!-- This is an auto-generated comment by CodeRabbit -->


# Delete docker Cluster

```bash
docker compose down
```



[clp-json-v0.4.0]: https://github.com/y-scope/clp/releases/tag/v0.4.0
[docker]: https://docs.docker.com/engine/install
[postgresql dataset]: https://zenodo.org/records/10516402

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
connector.name=clp
clp.metadata-provider-type=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_PROVIDERTYPE}
clp.metadata-db-url=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL}
clp.metadata-db-name=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME}
clp.metadata-db-user=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER}
clp.metadata-db-password=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD}
clp.metadata-table-prefix=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX}
clp.split-provider-type=${PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_SPLITPROVIDER}
clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Several placeholders have no matching definitions – coordinator fails to start.

coordinator.env currently defines only METADATA_PROVIDERTYPE and SPLITPROVIDER.
The following variables are missing and will survive as literal ${…} after envsubst, causing Presto to error at startup:

  • PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL
  • …_DATABASE_NAME
  • …_DATABASE_USER
  • …_DATABASE_PASSWORD
  • …_METADATA_TABLEPREFIX

Add them to coordinator.env (or remove the placeholders) before merging.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/clp.properties lines
2 to 9, several environment variable placeholders used for database
configuration do not have corresponding definitions in coordinator.env, causing
the coordinator to fail at startup. To fix this, add the missing environment
variables PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER,
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD, and
PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_TABLEPREFIX to the
coordinator.env file with appropriate values, or alternatively remove these
placeholders from clp.properties if they are not needed.


Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT}
query.max-memory=${PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORY}
query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIG_CONFIGPROPERTIES_QUERY_MAXMEMORYPERNODE}
discovery-server.enabled=true
discovery.uri=http://${PRESTO_COORDINATOR_SERVICENAME}:${PRESTO_COORDINATOR_HTTPPORT}
optimizer.optimize-hash-generation=false
regex-library=RE2J
use-alternative-function-signatures=true
inline-sql-functions=false
nested-data-serialization-enabled=false
native-execution-enabled=true

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-server
-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE}
-XX:+UseG1GC
-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_G1HEAPREGIONSIZE}
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add container-aware memory flags & optional G1 tuning

-Xmx hard-codes the ceiling and ignores the cgroup limit the container actually gets. Presto’s launcher (and Java ≥ 11) support:

-XX:+UseContainerSupport
-XX:MaxRAMPercentage=80

These keep the heap bounded to ~80 % of the container limit without manual tuning. Consider replacing -Xmx… or combining it with these flags.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/jvm.config lines 1 to
10, the current JVM options use a fixed -Xmx value which does not respect
container memory limits. To fix this, remove or complement the -Xmx flag with
container-aware flags by adding -XX:+UseContainerSupport and
-XX:MaxRAMPercentage=80. This will allow the JVM heap size to adapt dynamically
to the container's memory limit, improving resource management without manual
tuning.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid quoting the log level value in ${…}

Docker .env files often wrap string values in quotes ("DEBUG"). When the template engine substitutes the value here, the resulting line becomes:

com.facebook.presto="DEBUG"

The extra quotes are not a valid log level and Presto will fall back to the default (INFO). Either:

-com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
+com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL:-INFO}

and remove surrounding quotes in the .env file, or strip them inside the script that performs substitution.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL}
com.facebook.presto=${PRESTO_COORDINATOR_CONFIG_LOGPROPERTIES_LEVEL:-INFO}
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/log.properties at
line 1, the log level value substituted from the environment variable may
include quotes, causing invalid log level syntax. To fix this, ensure that the
substitution process strips any surrounding quotes from the environment variable
value before inserting it into the config file, or update the .env file to
provide the log level without quotes.


Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
}
Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Provide a minimal example or document expected schema

An empty JSON object is syntactically valid, but future maintainers may be unsure what keys are supported. A commented exemplar or pointer to docs beside this file would improve clarity without affecting runtime.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json
at lines 1 to 2, the JSON file is currently empty, which may confuse future
maintainers about the expected keys. Add a minimal example JSON object with
typical keys and values or include comments or a reference to documentation
explaining the expected schema to improve clarity without impacting runtime
behavior.

Comment on lines +1 to +2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anlowee I think we should explain to the user how to configure this for the timestamp field in their logs, right? And also that it may need to be different for each dataset they compress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we direct them to the related presto-doc section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not published and is more general than they need, right? We should write a simplified section for them here.


Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.id=${PRESTO_COORDINATOR_SERVICENAME}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

node.data-dir missing & static node.id risks collisions

Presto will refuse to start without node.data-dir, and using a constant value for node.id means every fresh container (or an additional coordinator) will register with the same ID, leading to “Node already exists” errors.

 node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
-node.id=${PRESTO_COORDINATOR_SERVICENAME}
+# A unique ID per container – fall back to HOSTNAME if not supplied
+node.id=${PRESTO_COORDINATOR_NODE_ID:-${HOSTNAME}}
+# Mandatory data directory
+node.data-dir=/opt/presto-server/data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.id=${PRESTO_COORDINATOR_SERVICENAME}
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
# A unique ID per container – fall back to HOSTNAME if not supplied
node.id=${PRESTO_COORDINATOR_NODE_ID:-${HOSTNAME}}
# Mandatory data directory
node.data-dir=/opt/presto-server/data
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/node.properties at
lines 1 to 3, add the required property node.data-dir with an appropriate
environment variable or path to ensure Presto starts correctly. Also, replace
the static node.id value with a dynamic or unique identifier, such as an
environment variable or generated UUID, to prevent node ID collisions when
multiple coordinators or containers run simultaneously.

19 changes: 19 additions & 0 deletions tools/deployment/presto-clp/coordinator/scripts/$
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/sh

# Exit on error
set -e

PRESTO_CONFIG_DIR="/opt/presto-server/etc"

# Substitute environemnt variables in config template
find /configs -type f | while read -r f; do
( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done

# Setup the config directory hierarchy
rm -rf ${PRESTO_CONFIG_DIR}/catalog
mkdir -p ${PRESTO_CONFIG_DIR}/catalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider consistency with the other generate-configs.sh script.

This script uses rm -rf and mkdir -p for catalog directory management, while the other version uses rm -f. Consider standardizing the approach across both scripts.

Also fix the quoting issue:

-rm -rf ${PRESTO_CONFIG_DIR}/catalog
-mkdir -p ${PRESTO_CONFIG_DIR}/catalog
+rm -rf "${PRESTO_CONFIG_DIR}/catalog"
+mkdir -p "${PRESTO_CONFIG_DIR}/catalog"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/scripts/$ around lines 14 to 15, the
script uses `rm -rf` to remove the catalog directory, which is inconsistent with
the other generate-configs.sh script that uses `rm -f`. Change the removal
command to use `rm -f` if only files are being removed or adjust both scripts to
use the same method for directory cleanup. Additionally, fix any quoting issues
by properly quoting variables like "${PRESTO_CONFIG_DIR}/catalog" to prevent
word splitting or globbing errors.


# Copy over files
mv ${PRESTO_CONFIG_DIR}/clp.properties ${PRESTO_CONFIG_DIR}/catalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Fix shell quoting issue.

Variables should be quoted to prevent potential issues.

-mv ${PRESTO_CONFIG_DIR}/clp.properties ${PRESTO_CONFIG_DIR}/catalog
+mv "${PRESTO_CONFIG_DIR}/clp.properties" "${PRESTO_CONFIG_DIR}/catalog"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mv ${PRESTO_CONFIG_DIR}/clp.properties ${PRESTO_CONFIG_DIR}/catalog
mv "${PRESTO_CONFIG_DIR}/clp.properties" "${PRESTO_CONFIG_DIR}/catalog"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/scripts/$ at line 18, the variable
PRESTO_CONFIG_DIR is not quoted in the mv command, which can cause issues if the
variable contains spaces or special characters. Fix this by adding double quotes
around the variable references in the mv command to ensure proper handling of
the path.


Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/sh

# Exit on error
set -e

PRESTO_CONFIG_DIR="/opt/presto-server/etc"

# Substitute environemnt variables in config template
find /configs -type f | while read -r f; do
( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security risk: Avoid shell evaluation for template processing.

The current approach using shell evaluation (( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh) poses a security risk if template files contain malicious shell code. Additionally, the unquoted variable $f could cause issues with filenames containing spaces.

Consider using envsubst for safer environment variable substitution:

-# Substitute environemnt variables in config template
-find /configs -type f | while read -r f; do
-  ( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
-done
+# Substitute environment variables in config template
+find /configs -type f | while read -r f; do
+  envsubst < "$f" > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
+done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Substitute environemnt variables in config template
find /configs -type f | while read -r f; do
( echo "cat <<EOF"; cat $f; echo "EOF" ) | sh > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
# Substitute environment variables in config template
find /configs -type f | while read -r f; do
envsubst < "$f" > "${PRESTO_CONFIG_DIR}/$(basename "$f")"
done
🧰 Tools
🪛 Shellcheck (0.10.0)

[info] 10-10: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/scripts/generate-configs.sh around
lines 8 to 11, the script uses shell evaluation to substitute environment
variables in config templates, which is a security risk and mishandles filenames
with spaces. Replace the current shell evaluation approach with the use of
`envsubst` to safely substitute environment variables without executing
arbitrary shell code, and ensure all file path variables are properly quoted to
handle spaces correctly.


# Setup the config directory hierarchy
rm -f ${PRESTO_CONFIG_DIR}/catalog/*

# Copy over files
mv ${PRESTO_CONFIG_DIR}/clp.properties ${PRESTO_CONFIG_DIR}/catalog

39 changes: 39 additions & 0 deletions tools/deployment/presto-clp/demo-assets/clp-config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package:
storage_engine: "clp-s"
database:
type: "mariadb"
host: "${REPLACE_IP}"
port: 6001
name: "clp-db"
query_scheduler:
host: "${REPLACE_IP}"
port: 6002
jobs_poll_delay: 0.1
num_archives_to_search_per_sub_job: 16
logging_level: "INFO"
queue:
host: "${REPLACE_IP}"
port: 6003
redis:
host: "${REPLACE_IP}"
port: 6004
query_backend_database: 0
compression_backend_database: 1
reducer:
host: "${REPLACE_IP}"
base_port: 6100
logging_level: "INFO"
upsert_interval: 100
results_cache:
host: "${REPLACE_IP}"
port: 6005
db_name: "clp-query-results"
stream_collection_name: "stream-files"
webui:
host: "localhost"
port: 6000
logging_level: "INFO"
log_viewer_webui:
host: "localhost"
port: 6006

50 changes: 50 additions & 0 deletions tools/deployment/presto-clp/demo-assets/init.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/usr/bin/env bash

SCRIPT_PATH="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

function generate_sample_clp_config {
local ip=$(hostname -i)
local file="${SCRIPT_PATH}/clp-config.yml"
cp "$file" "${file}.bak"
sed -i "s|\${REPLACE_IP}|$ip|g" "$file"
echo "Replaced \${REPLACE_IP} with $ip in $file"
}

function update_metadata_config {
if [[ $# -ne 1 ]]; then
echo "Usage: update_metadata_config </path/to/clp-package>"
return 1
fi

local clp_pkg_home=$1
local clp_config_path="$(readlink -f ${clp_pkg_home})/etc/clp-config.yml"
local credential_path="$(readlink -f ${clp_pkg_home})/etc/credentials.yml"
host=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["host"])' < "$clp_config_path")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Quote variables forwarded to readlink to preserve paths with spaces

-local clp_pkg_home=$1
-local clp_config_path="$(readlink -f ${clp_pkg_home})/etc/clp-config.yml"
+local clp_pkg_home="$1"
+local clp_config_path="$(readlink -f "$clp_pkg_home")/etc/clp-config.yml"

Same adjustment applies to credential_path and the later readlink -f call.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 20-20: Declare and assign separately to avoid masking return values.

(SC2155)


[info] 20-20: Double quote to prevent globbing and word splitting.

(SC2086)


[warning] 21-21: Declare and assign separately to avoid masking return values.

(SC2155)


[info] 21-21: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/demo-assets/init.sh around lines 19 to 22, the
variables passed to readlink are not quoted, which can cause issues with paths
containing spaces. Fix this by adding double quotes around the variable
expansions inside the readlink commands for clp_config_path, credential_path,
and any other readlink -f calls to ensure paths with spaces are handled
correctly.

port=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["port"])' < "$clp_config_path")
name=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["name"])' < "$clp_config_path")
user=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["user"])' < "$credential_path")
password=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["password"])' < "$credential_path")
echo "Metadata database host: $host"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace unsafe yaml.load with yaml.safe_load

yaml.load will execute arbitrary Python objects if someone tampers with the YAML. Use the safe loader.

-host=$(python3 -c 'import sys, yaml; print(yaml.load(sys.stdin)["database"]["host"])' < "$clp_config_path")
+host=$(python3 -c 'import sys, yaml; import yaml, sys; print(yaml.safe_load(sys.stdin)["database"]["host"])' < "$clp_config_path")

Repeat for the other four calls (port, name, user, password).

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/demo-assets/init.sh around lines 22 to 27,
replace all instances of yaml.load with yaml.safe_load in the python3 commands
extracting database configuration values to prevent execution of arbitrary
Python objects. Update the commands for host, port, name, user, and password to
use yaml.safe_load instead of yaml.load.

echo "Metadata database port: $port"
echo "Metadata database name: $name"
echo "Metadata database user: $user"
echo "Metadata database password: $password"

local env_path="${SCRIPT_PATH}/../.env"
cp "$env_path" "${env_path}.bak"
sed -i "s|^PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL=.*|PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_URL=\"jdbc:mysql://${host}:${port}\"|" "$env_path"
sed -i "s/^PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME=.*/PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_NAME=\"${name}\"/" "$env_path"
sed -i "s/^PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER=.*/PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_USER=\"${user}\"/" "$env_path"
sed -i "s/^PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD=.*/PRESTO_COORDINATOR_CONFIG_CLPPROPERTIES_METADATA_DATABASE_PASSWORD=\"${password}\"/" "$env_path"
sed -i "s|^CLP_PACKAGE_ARCHIVES=.*|CLP_PACKAGE_ARCHIVES=\"$(readlink -f ${clp_pkg_home})/var/data/archives/default\"|" "$env_path"
}

if declare -f "$1" > /dev/null; then
"$@"
else
echo "Error: '$1' is not a valid function name."
echo "Available functions:"
declare -F | awk '{print $3}'
exit 1
fi

39 changes: 39 additions & 0 deletions tools/deployment/presto-clp/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
version: "3.9"

services:
presto-coordinator:
image: ghcr.io/y-scope/presto/coordinator:dev
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- .env
volumes:
- ./coordinator/scripts:/scripts:ro
- coordinator-config:/opt/presto-server/etc
- ./coordinator/config-template:/configs:ro
networks:
- presto

presto-worker:
image: ghcr.io/y-scope/presto/prestissimo-worker:dev
depends_on:
presto-coordinator:
condition: service_started
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- .env
volumes:
- ./worker/scripts:/scripts:ro
- worker-config:/opt/presto-server/etc
- ./worker/config-template:/configs:ro
- "${CLP_PACKAGE_ARCHIVES}:${CLP_PACKAGE_ARCHIVES}"
networks:
- presto

volumes:
coordinator-config:
worker-config:

networks:
presto:
driver: bridge

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
connector.name=clp

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
discovery.uri=http://${PRESTO_COORDINATOR_SERVICENAME}:${PRESTO_COORDINATOR_HTTPPORT}
presto.version=REPLACE_ME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Prefer an env placeholder over hard-coded REPLACE_ME for Presto version.

Replacing REPLACE_ME via sed in the bootstrap script works, but using an env var keeps the template declarative and avoids sed-ordering bugs:

-presto.version=REPLACE_ME
+presto.version=${PRESTO_VERSION}

The script can then simply export PRESTO_VERSION=$(curl …) and rely on envsubst.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
presto.version=REPLACE_ME
presto.version=${PRESTO_VERSION}
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/config-template/config.properties at line
2, replace the hard-coded "REPLACE_ME" for presto.version with an environment
variable placeholder like "${PRESTO_VERSION}". This change allows the bootstrap
script to export PRESTO_VERSION dynamically and use envsubst for substitution,
making the template declarative and avoiding sed ordering issues.

http-server.http.port=${PRESTO_WORKER_HTTPPORT}
shutdown-onset-sec=1
register-test-functions=false
runtime-metrics-collection-enabled=false

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Environment variable name may be misleading for the worker template.

node.environment expands ${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}, which is defined in the coordinator env file. That couples the worker to the coordinator implementation detail and will break if someone supplies only worker.env.
Rename to a worker-scoped variable or alias it in worker.env to avoid hidden dependency.

-node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
+node.environment=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
node.environment=${PRESTO_COORDINATOR_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
node.environment=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT}
node.internal-address=REPLACE_ME
node.location=${PRESTO_WORKER_CONFIG_NODEPROPERTIES_LOCATION}
node.id=REPLACE_ME
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/worker/config-template/node.properties lines 1 to
4, the environment variable used for node.environment is referencing a
coordinator-specific variable, creating an unintended dependency. Change the
variable to a worker-specific environment variable by renaming it to something
like PRESTO_WORKER_CONFIG_NODEPROPERTIES_ENVIRONMENT or create an alias in the
worker.env file that maps to the coordinator variable to avoid coupling and
ensure the worker config is self-contained.


Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mutable-config=true

Loading
Loading