Skip to content
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3a84840
Init
anlowee Jul 27, 2025
9574059
Add new files to yaml linting; Fix yamllint violations.
kirkrodrigues Jul 27, 2025
8dd8795
Remove unnecessarily blank lines.
kirkrodrigues Jul 27, 2025
a8af40c
Remove unnecesary script.
kirkrodrigues Jul 27, 2025
2015a4a
Replace demo-assets/init.sh and demo CLP config file with more robust…
kirkrodrigues Jul 27, 2025
b78682f
Add missing return on error.
kirkrodrigues Jul 27, 2025
1247bba
Apply shell linters to worker/scripts/generate-configs.sh
kirkrodrigues Jul 28, 2025
1400a90
Use Docker compose to wait for the coordinator to be ready.
kirkrodrigues Jul 28, 2025
91040d1
Use jq to parse Presto version info.
kirkrodrigues Jul 28, 2025
9ec648c
Clean-up wget command.
kirkrodrigues Jul 28, 2025
4097478
Use /usr/bin/env.
kirkrodrigues Jul 28, 2025
b1ce135
Use function to update kv-pairs in config file. Set kv-pairs if they …
kirkrodrigues Jul 28, 2025
70a56d7
Move getting Presto coordinator version into a function.
kirkrodrigues Jul 28, 2025
99ac4e1
Minor edits for consistency.
kirkrodrigues Jul 28, 2025
6dde297
fix: Set error policies.
kirkrodrigues Jul 28, 2025
b22746d
Mark constants readonly.
kirkrodrigues Jul 28, 2025
708756d
Clean-up comments.
kirkrodrigues Jul 28, 2025
2eeda5c
Quote paths.
kirkrodrigues Jul 28, 2025
8bb98f8
Clean-up presto-clp/coordinator/scripts/generate-configs.sh.
kirkrodrigues Jul 28, 2025
9fc487d
Remove deprecated version property.
kirkrodrigues Jul 28, 2025
64e8edf
Alphabetize mounts.
kirkrodrigues Jul 28, 2025
00bd3f1
Rename environment variables for clarity.
kirkrodrigues Jul 28, 2025
28f5b70
fix: Remove spurious equals sign.
kirkrodrigues Jul 28, 2025
cccf9fe
Lint set-up-config.sh.
kirkrodrigues Jul 28, 2025
04f3d2a
Add docs and remove README.
kirkrodrigues Jul 28, 2025
3381eb8
Set coordinator log level to INFO.
kirkrodrigues Jul 28, 2025
bea5bb6
Validate CLP metadata database type.
kirkrodrigues Jul 28, 2025
1f68965
Use logging function rather than echos.
kirkrodrigues Jul 28, 2025
94b0210
Reorder functions.
kirkrodrigues Jul 28, 2025
22c53e0
Add new docs to index.
kirkrodrigues Jul 28, 2025
2f0de28
Add S3 limitation.
kirkrodrigues Jul 28, 2025
d621bf3
Add clone step to docs.
kirkrodrigues Jul 28, 2025
7300d20
Add required CLP version to docs.
kirkrodrigues Jul 28, 2025
4bc2f58
Rename PRESTO_WORKER_HTTPPORT.
kirkrodrigues Jul 28, 2025
80c41d4
Remove unnecessary quotes from env var files.
kirkrodrigues Jul 28, 2025
9d2146b
Address some rabbit feedback.
kirkrodrigues Jul 28, 2025
ff06c62
Address the nested field limitation comment
anlowee Jul 28, 2025
cf11694
Add metadata filter config
anlowee Jul 28, 2025
2618f13
Merge branch 'main' into xwei/yscope-comose
anlowee Jul 28, 2025
782b87d
Docs edits.
kirkrodrigues Jul 28, 2025
8107518
Add error checking for config files not existing.
kirkrodrigues Jul 28, 2025
4d896e5
More docs edits.
kirkrodrigues Jul 28, 2025
713670a
Remove extra spaces.
kirkrodrigues Jul 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/src/user-guide/guides-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ Using object storage
Using CLP to ingest logs from object storage and store archives on object storage.
:::

:::{grid-item-card}
:link: guides-using-presto
Using Presto with CLP
^^^
How to use Presto to query compressed logs in CLP.
:::
Comment on lines +15 to +20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Maintain link-path conventions for guide cards.

Existing cards either use an /index suffix or point to a directory that contains an index file. Confirm that guides-using-presto follows the same convention (directory with index.md) to avoid broken links, or append /index here for consistency.

🤖 Prompt for AI Agents
In docs/src/user-guide/guides-overview.md around lines 15 to 20, the link path
for the guide card "guides-using-presto" does not follow the established
convention of either using an /index suffix or pointing to a directory
containing an index.md file. Verify if "guides-using-presto" is a directory with
an index.md file; if not, append "/index" to the link path to maintain
consistency and prevent broken links.


:::{grid-item-card}
:link: guides-multi-node
Multi-node deployment
Expand Down
178 changes: 178 additions & 0 deletions docs/src/user-guide/guides-using-presto.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Using Presto with CLP

[Presto] is a distributed SQL query engine that can be used to query data stored in CLP (using SQL).
This guide describes how to set up and use Presto with CLP.

:::{warning}
Currently, only the [clp-json](quick-start/clp-json.md) flavor of CLP supports queries through
Presto.
:::

:::{note}
Currently, this integration with Presto is under development and may change in the future. It is
also being maintained in a [fork][yscope-presto] of the Presto project. We are working on merging
these changes into the main Presto repository so that you can use official Presto releases with CLP.
:::

## Requirements

* [CLP][clp-releases] (clp-json) v0.4.0 or higher
* [Docker] v28 or higher
* [Docker Compose][docker-compose] v2.20.2 or higher
* Python
* python3-venv (for the version of Python installed)

## Set up

Using Presto with CLP requires:

* [Setting up CLP](#setting-up-clp) and compressing some logs.
* [Setting up Presto](#setting-up-presto) to query CLP's metadata database and archives.

### Setting up CLP

Follow the [quick-start](./quick-start/index.md) guide to set up CLP and compress your logs. A
sample dataset that works well with Presto is [postgresql].

### Setting up Presto

1. Clone the CLP repository:

```bash
git clone https://github.com/y-scope/clp.git
```

2. Navigate to the `tools/deployment/presto-clp` directory in your terminal.
3. Run the following script to generate the necessary config for Presto to work with CLP:

```bash
scripts/set-up-config.sh <clp-json-dir>
```

* `<clp-json-dir>` is the location of the clp-json package you set up in the previous section.

Note that for the metadata filter config (i.e.,
`tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json`), it is a config
to indicate which columns are used for filtering splits that will be processed by Presto. Here
is an example:

```json
{
"clp.default.default": [
{
"columnName": "timestamp",
"rangeMapping": {
"lowerBound": "begin_timestamp",
"upperBound": "end_timestamp"
},
"required": false
}
]
}
```

* `"clp.default.default"` is the filter's scope. A scope can be one of the following:
* A catalog name
* A fully-qualified schema name
* A fully-qualified table name
Filter configs under a particular scope will apply to all child scopes. For example, filter
configs at the schema level will apply to all tables within that schema. In this example,
the filter will only apply to the `default` table under the `default` schema of the `clp`
catalog.
* `"columnName"` is the data column's name. You can use the column used as `--timestamp-key`
when compressing if you want to filter splits by timestamp.

* `"rangeMapping"` is an optional object with the following properties:

* `"lowerBound"` is the metadata column that represents the lower bound of values in a split
for the data column.
* `"upperBound"` is the metadata column that represents the upper bound of values in a split
for the data column.

In this example, since in CLP's metadata database, for each split (i.e., archive) there are
two fields `begin_timestamp` and `end_timestamp` to store the earilest and latest timestamps
of the log messages compressed in that split, we have to remap the original data column's
name to these two fields so that it can query the metadata database to retrieve filtered
splits.

* `"required"` is an optional field (defaults to false) which indicates whether the filter must
be present in the translated metadata filter SQL query. If a required filter is missing or
cannot be pushed down, the query will be rejected.

4. Start a Presto cluster by running:

```bash
docker compose up
```

* To use more than Presto worker, you can use the `--scale` option as follows:

```bash
Comment on lines +87 to +89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Grammar: insert missing word in scaling instruction

-    * To use more than Presto worker, you can use the `--scale` option as follows:
+    * To use more than one Presto worker, you can use the `--scale` option as follows:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* To use more than Presto worker, you can use the `--scale` option as follows:
```bash
* To use more than one Presto worker, you can use the `--scale` option as follows:
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-using-presto.md around lines 111 to 113, the
scaling instruction is missing a word for correct grammar. Insert the missing
word "one" after "more than" in the sentence to read "To use more than one
Presto worker, you can use the `--scale` option as follows:" to fix the grammar.

docker compose up --scale presto-worker=<num-workers>
```

* `<num-workers>` is the number of Presto worker nodes you want to run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add clarification about worker scaling

The scaling instructions could benefit from explaining when multiple workers would be beneficial.

Consider adding context about when to use multiple workers:

-    * To use more than Presto worker, you can use the `--scale` option as follows:
+    * To use multiple Presto workers (beneficial for larger datasets or improved query parallelism), you can use the `--scale` option as follows:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* To use more than Presto worker, you can use the `--scale` option as follows:
```bash
docker compose up --scale presto-worker=<num-workers>
```
* `<num-workers>` is the number of Presto worker nodes you want to run.
* To use multiple Presto workers (beneficial for larger datasets or improved query parallelism), you can use the `--scale` option as follows:
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-using-presto.md around lines 53 to 59, add a brief
explanation after the scaling instructions to clarify when using multiple Presto
workers is beneficial, such as improving query performance or handling larger
workloads. This will provide users with context on why and when to scale the
number of worker nodes.


### Stopping the Presto cluster

To stop the Presto cluster, use CTRL + C.

To clean up the Presto cluster entirely:

```bash
docker compose rm
```

## Querying your logs through Presto

To query your logs through Presto, you can use the Presto CLI:

```bash
docker compose exec presto-coordinator \
presto-cli \
--catalog clp \
--schema default
```

Each dataset in CLP shows up as a table in Presto. To show all available datasets:

```sql
SHOW TABLES;
```

If you didn't specify a dataset when compressing your logs in CLP, your logs will have been stored
in the `default` dataset. If you also didn't specify any metadata filters, you can query the logs
in this dataset:

```sql
SELECT * FROM default LIMIT 1;
```

All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
contain the field `foo.bar`, you can query it using:

```sql
SELECT foo.bar FROM default LIMIT 1;
```
Comment on lines +129 to +134
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Define the abbreviation “kv-pairs” for clarity

Readers who are new to CLP may not immediately recognise the shorthand.

-All kv-pairs in each log event can be queried directly using dot-notation.
+All key-value (KV) pairs in each log event can be queried directly using dot notation.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
contain the field `foo.bar`, you can query it using:
```sql
SELECT foo.bar FROM default LIMIT 1;
```
All key-value (KV) pairs in each log event can be queried directly using dot notation. For example, if your logs
contain the field `foo.bar`, you can query it using:
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-using-presto.md around lines 129 to 134, the term
"kv-pairs" is used without defining the abbreviation. Update the text to define
"kv-pairs" as "key-value pairs" the first time it appears to improve clarity for
readers unfamiliar with the shorthand.


## Limitations

The Presto CLP integration has the following limitations at present:

* Nested fields containing special characters (i.e., any non-alphanumeric characters except `_`;
see [y-scope/presto#8]). To get around this limitation,you will need to preprocess your logs to
remove such special characters.
* Only logs stored on the filesystem, rather than S3, can be queried through Presto.

These limitations will be addressed in a future release of the Presto integration.

[clp-releases]: https://github.com/y-scope/clp/releases
[docker-compose]: https://docs.docker.com/compose/install/
[Docker]: https://docs.docker.com/engine/install/
[postgresql]: https://zenodo.org/records/10516401
[Presto]: https://prestodb.io/
[y-scope/presto#8]: https://github.com/y-scope/presto/issues/8
[yscope-presto]: https://github.com/y-scope/presto
1 change: 1 addition & 0 deletions docs/src/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ quick-start/clp-text
guides-overview
guides-using-object-storage/index
guides-multi-node
guides-using-presto
:::

:::{toctree}
Expand Down
4 changes: 3 additions & 1 deletion taskfiles/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,8 @@ tasks:
components/package-template/src/etc \
docs \
taskfile.yaml \
taskfiles
taskfiles \
tools/deployment

check-cpp-format:
sources: &cpp_source_files
Expand Down Expand Up @@ -772,6 +773,7 @@ tasks:
- "components/clp-py-utils/clp_py_utils"
- "components/core/tools/scripts/utils"
- "components/job-orchestration/job_orchestration"
- "tools/deployment"
- "tools/scripts"
Comment on lines +776 to 777
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Black/Ruff may recurse through large non-code trees under tools/deployment.

Running black . from tools/deployment will walk every sub-directory (configs, templates, logs). That can add noticeable time and, in worst cases, choke on extremely large data files a user may drop under that tree. Consider narrowing the path to tools/deployment/**/*.py or adding an --extend-exclude for obvious non-code directories (e.g., config-template, scripts/generated).

🤖 Prompt for AI Agents
In taskfiles/lint.yaml around lines 776 to 777, the current lint paths include
"tools/deployment" which causes Black and Ruff to recursively process large
non-code directories, slowing down linting and potentially causing errors.
Modify the lint paths to only include Python files under "tools/deployment" by
changing the path to "tools/deployment/**/*.py" or add an --extend-exclude
option to exclude known non-code directories like "configs" or "templates" to
prevent unnecessary recursion.

- "docs/conf"
cmd: |-
Expand Down
5 changes: 5 additions & 0 deletions tools/deployment/presto-clp/coordinator-common.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
PRESTO_COORDINATOR_HTTPPORT=8080
PRESTO_COORDINATOR_SERVICENAME=presto-coordinator

# node.properties
PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT=production
14 changes: 14 additions & 0 deletions tools/deployment/presto-clp/coordinator.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# clp.properties
PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE=mysql
PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER=mysql

# config.properties
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY=1GB
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE=1GB

# jvm.config
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE=4G
PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE=32M

# log.properties
PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL=INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
connector.name=clp
clp.metadata-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE}
clp.metadata-db-url=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL}
clp.metadata-db-name=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME}
clp.metadata-db-user=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_USER}
clp.metadata-db-password=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_PASSWORD}
clp.metadata-table-prefix=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_TABLE_PREFIX}
clp.split-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER}
clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT}
query.max-memory=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY}
query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE}
discovery-server.enabled=true
discovery.uri=${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

discovery.uri placeholder lacks a definition
${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI} is not set anywhere in the provided env files, so the generated property will be empty. Presto’s coordinator refuses to start without a valid discovery URI. Add the variable or build the URI inside the template from existing pieces (e.g., service name + port).

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/config.properties at
line 7, the placeholder ${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI} is
undefined, causing the discovery.uri property to be empty and preventing Presto
coordinator startup. Define the environment variable
PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI in the relevant env files with
a valid URI or modify the template to construct the discovery URI dynamically
using existing variables such as service name and port.

optimizer.optimize-hash-generation=false
regex-library=RE2J
use-alternative-function-signatures=true
inline-sql-functions=false
nested-data-serialization-enabled=false
native-execution-enabled=true
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
-server
-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE}
-XX:+UseG1GC
-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE}
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
com.facebook.presto=${PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{
}
Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Provide a minimal example or document expected schema

An empty JSON object is syntactically valid, but future maintainers may be unsure what keys are supported. A commented exemplar or pointer to docs beside this file would improve clarity without affecting runtime.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json
at lines 1 to 2, the JSON file is currently empty, which may confuse future
maintainers about the expected keys. Add a minimal example JSON object with
typical keys and values or include comments or a reference to documentation
explaining the expected schema to improve clarity without impacting runtime
behavior.

Comment on lines +1 to +2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anlowee I think we should explain to the user how to configure this for the timestamp field in their logs, right? And also that it may need to be different for each dataset they compress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we direct them to the related presto-doc section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not published and is more general than they need, right? We should write a simplified section for them here.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node.environment=${PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT}
node.id=${PRESTO_COORDINATOR_SERVICENAME}
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env bash

set -eu
set -o pipefail

readonly PRESTO_CONFIG_DIR="/opt/presto-server/etc"

# Substitute environment variables in config template
find /configs -type f | while read -r f; do
(
echo "cat <<EOF"
cat "$f"
echo "EOF"
) | sh >"${PRESTO_CONFIG_DIR}/$(basename "$f")"
done

# Remove existing catalog files that exist in the image and add the CLP catalog
rm -f "${PRESTO_CONFIG_DIR}/catalog/"*
mv "${PRESTO_CONFIG_DIR}/clp.properties" "${PRESTO_CONFIG_DIR}/catalog"
48 changes: 48 additions & 0 deletions tools/deployment/presto-clp/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
services:
presto-coordinator:
image: "ghcr.io/y-scope/presto/coordinator:dev"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using stable image tags instead of 'dev'

Using the 'dev' tag in production deployments can lead to unpredictable behavior as the image content may change without notice.

Consider using specific version tags or at minimum a more stable tag:

-    image: "ghcr.io/y-scope/presto/coordinator:dev"
+    image: "ghcr.io/y-scope/presto/coordinator:${PRESTO_VERSION:-latest}"

And similarly for the worker:

-    image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
+    image: "ghcr.io/y-scope/presto/prestissimo-worker:${PRESTO_VERSION:-latest}"
🤖 Prompt for AI Agents
In tools/deployment/presto-clp/docker-compose.yaml at line 3, the image tag
"dev" is used, which is unstable for production. Replace the "dev" tag with a
specific version number or a stable tag to ensure predictable and consistent
deployments. Also, apply the same change to the worker image tag to maintain
stability across all services.

entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- ".env"
- "coordinator-common.env"
- "coordinator.env"
volumes:
- "./coordinator/config-template:/configs:ro"
- "./coordinator/scripts:/scripts:ro"
- "coordinator-config:/opt/presto-server/etc"
networks:
- "presto"
healthcheck:
test:
- "CMD"
- "curl"
- "-f"
- "${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}/v1/info"
interval: "10s"
retries: 30

presto-worker:
image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
depends_on:
presto-coordinator:
condition: "service_healthy"
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
env_file:
- ".env"
- "coordinator-common.env"
- "worker.env"
volumes:
- "./worker/config-template:/configs:ro"
- "./worker/scripts:/scripts:ro"
- "${CLP_ARCHIVES_DIR}:${CLP_ARCHIVES_DIR}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Host path mount requires documentation

The ${CLP_ARCHIVES_DIR}:${CLP_ARCHIVES_DIR} mount binds a host directory, which requires proper setup and may have security implications.

Verify that the CLP_ARCHIVES_DIR environment variable is properly documented and validated:


🏁 Script executed:

#!/bin/bash
# Search for documentation or validation of CLP_ARCHIVES_DIR
rg -A 5 -B 5 "CLP_ARCHIVES_DIR"

Length of output: 2006


🏁 Script executed:

#!/bin/bash
# Inspect environment templates for CLP_ARCHIVES_DIR entries and comments
echo "=== coordinator-common.env ==="
sed -n '1,200p' tools/deployment/presto-clp/coordinator-common.env
echo "---"
echo "=== worker.env ==="
sed -n '1,200p' tools/deployment/presto-clp/worker.env

Length of output: 524


Document and validate CLP_ARCHIVES_DIR

The CLP_ARCHIVES_DIR variable is currently used in docker-compose.yaml to bind‐mount a host folder, but it’s not documented in your .env templates or user guide. Please:

• Add CLP_ARCHIVES_DIR (with a clear description, default value and security/permissions notes) to

  • tools/deployment/presto-clp/coordinator-common.env
  • tools/deployment/presto-clp/worker.env
    or into your project README under “Presto CLP Deployment.”
    • Ensure you validate at startup (or in generate-user-env-vars-file.py) that the directory exists and is writable, failing fast if not.

This will make setup instructions clearer and surface any potential security issues before runtime.

🤖 Prompt for AI Agents
In tools/deployment/presto-clp/docker-compose.yaml at line 37, the environment
variable CLP_ARCHIVES_DIR is used for a volume mount but lacks documentation and
validation. Add CLP_ARCHIVES_DIR with a clear description, default value, and
security/permissions notes to tools/deployment/presto-clp/coordinator-common.env
and tools/deployment/presto-clp/worker.env or the project README under "Presto
CLP Deployment." Additionally, update generate-user-env-vars-file.py or the
startup scripts to check that the directory specified by CLP_ARCHIVES_DIR exists
and is writable, and fail fast with an error if these conditions are not met.

- "worker-config:/opt/presto-server/etc"
networks:
- "presto"

volumes:
coordinator-config:
worker-config:

networks:
presto:
driver: "bridge"
1 change: 1 addition & 0 deletions tools/deployment/presto-clp/scripts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.venv/
Loading
Loading