-
Notifications
You must be signed in to change notification settings - Fork 83
feat(presto-clp): Add Docker compose setup for Presto cluster that can connect to clp-json. #1132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 36 commits
3a84840
9574059
8dd8795
a8af40c
2015a4a
b78682f
1247bba
1400a90
91040d1
9ec648c
4097478
b1ce135
70a56d7
99ac4e1
6dde297
b22746d
708756d
2eeda5c
8bb98f8
9fc487d
64e8edf
00bd3f1
28f5b70
cccf9fe
04f3d2a
3381eb8
bea5bb6
1f68965
94b0210
22c53e0
2f0de28
d621bf3
7300d20
4bc2f58
80c41d4
9d2146b
ff06c62
cf11694
2618f13
782b87d
8107518
4d896e5
713670a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,13 @@ Using object storage | |
Using CLP to ingest logs from object storage and store archives on object storage. | ||
::: | ||
|
||
:::{grid-item-card} | ||
:link: guides-using-presto | ||
Using Presto with CLP | ||
^^^ | ||
How to use Presto to query compressed logs in CLP. | ||
::: | ||
Comment on lines
+15
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Maintain link-path conventions for guide cards. Existing cards either use an 🤖 Prompt for AI Agents
|
||
|
||
:::{grid-item-card} | ||
:link: guides-multi-node | ||
Multi-node deployment | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,126 @@ | ||||||||||||||||||||
# Using Presto with CLP | ||||||||||||||||||||
|
||||||||||||||||||||
[Presto] is a distributed SQL query engine that can be used to query data stored in CLP (using SQL). | ||||||||||||||||||||
This guide describes how to set up and use Presto with CLP. | ||||||||||||||||||||
|
||||||||||||||||||||
:::{warning} | ||||||||||||||||||||
Currently, only the [clp-json](quick-start/clp-json.md) flavor of CLP supports queries through | ||||||||||||||||||||
Presto. | ||||||||||||||||||||
::: | ||||||||||||||||||||
|
||||||||||||||||||||
:::{note} | ||||||||||||||||||||
Currently, this integration with Presto is under development and may change in the future. It is | ||||||||||||||||||||
also being maintained in a [fork][yscope-presto] of the Presto project. We are working on merging | ||||||||||||||||||||
these changes into the main Presto repository so that you can use official Presto releases with CLP. | ||||||||||||||||||||
::: | ||||||||||||||||||||
|
||||||||||||||||||||
## Requirements | ||||||||||||||||||||
|
||||||||||||||||||||
* [CLP][clp-releases] (clp-json) v0.4.0 or higher | ||||||||||||||||||||
* [Docker] v28 or higher | ||||||||||||||||||||
* [Docker Compose][docker-compose] v2.20.2 or higher | ||||||||||||||||||||
* Python | ||||||||||||||||||||
* python3-venv (for the version of Python installed) | ||||||||||||||||||||
|
||||||||||||||||||||
## Set up | ||||||||||||||||||||
|
||||||||||||||||||||
Using Presto with CLP requires: | ||||||||||||||||||||
|
||||||||||||||||||||
* [Setting up CLP](#setting-up-clp) and compressing some logs. | ||||||||||||||||||||
* [Setting up Presto](#setting-up-presto) to query CLP's metadata database and archives. | ||||||||||||||||||||
|
||||||||||||||||||||
### Setting up CLP | ||||||||||||||||||||
|
||||||||||||||||||||
Follow the [quick-start](./quick-start/index.md) guide to set up CLP and compress your logs. A | ||||||||||||||||||||
sample dataset that works well with Presto is [postgresql]. | ||||||||||||||||||||
|
||||||||||||||||||||
### Setting up Presto | ||||||||||||||||||||
|
||||||||||||||||||||
1. Clone the CLP repository: | ||||||||||||||||||||
|
||||||||||||||||||||
```bash | ||||||||||||||||||||
git clone https://github.com/y-scope/clp.git | ||||||||||||||||||||
``` | ||||||||||||||||||||
|
||||||||||||||||||||
2. Navigate to the `tools/deployment/presto-clp` directory in your terminal. | ||||||||||||||||||||
3. Run the following script to generate the necessary config for Presto to work with CLP: | ||||||||||||||||||||
|
||||||||||||||||||||
```bash | ||||||||||||||||||||
scripts/set-up-config.sh <clp-json-dir> | ||||||||||||||||||||
``` | ||||||||||||||||||||
|
||||||||||||||||||||
* `<clp-json-dir>` is the location of the clp-json package you set up in the previous section. | ||||||||||||||||||||
|
||||||||||||||||||||
4. Start a Presto cluster by running: | ||||||||||||||||||||
|
||||||||||||||||||||
```bash | ||||||||||||||||||||
docker compose up | ||||||||||||||||||||
``` | ||||||||||||||||||||
|
||||||||||||||||||||
* To use more than Presto worker, you can use the `--scale` option as follows: | ||||||||||||||||||||
|
||||||||||||||||||||
```bash | ||||||||||||||||||||
Comment on lines
+87
to
+89
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Grammar: insert missing word in scaling instruction - * To use more than Presto worker, you can use the `--scale` option as follows:
+ * To use more than one Presto worker, you can use the `--scale` option as follows: 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
|
||||||||||||||||||||
docker compose up --scale presto-worker=<num-workers> | ||||||||||||||||||||
``` | ||||||||||||||||||||
|
||||||||||||||||||||
* `<num-workers>` is the number of Presto worker nodes you want to run. | ||||||||||||||||||||
|
* To use more than Presto worker, you can use the `--scale` option as follows: | |
```bash | |
docker compose up --scale presto-worker=<num-workers> | |
``` | |
* `<num-workers>` is the number of Presto worker nodes you want to run. | |
* To use multiple Presto workers (beneficial for larger datasets or improved query parallelism), you can use the `--scale` option as follows: |
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-using-presto.md around lines 53 to 59, add a brief
explanation after the scaling instructions to clarify when using multiple Presto
workers is beneficial, such as improving query performance or handling larger
workloads. This will provide users with context on why and when to scale the
number of worker nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also mention that this only works when the user didn't setup any "required" metadata filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. That should probably go in the section about configuring the metadata filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Define the abbreviation “kv-pairs” for clarity
Readers who are new to CLP may not immediately recognise the shorthand.
-All kv-pairs in each log event can be queried directly using dot-notation.
+All key-value (KV) pairs in each log event can be queried directly using dot notation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs | |
contain the field `foo.bar`, you can query it using: | |
```sql | |
SELECT foo.bar FROM default LIMIT 1; | |
``` | |
All key-value (KV) pairs in each log event can be queried directly using dot notation. For example, if your logs | |
contain the field `foo.bar`, you can query it using: | |
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-using-presto.md around lines 129 to 134, the term
"kv-pairs" is used without defining the abbreviation. Update the text to define
"kv-pairs" as "key-value pairs" the first time it appears to improve clarity for
readers unfamiliar with the shorthand.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anlowee Do you know exactly what characters are allowed and disallowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chars except _
when using Presstimo
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -103,7 +103,8 @@ tasks: | |
components/package-template/src/etc \ | ||
docs \ | ||
taskfile.yaml \ | ||
taskfiles | ||
taskfiles \ | ||
tools/deployment | ||
|
||
check-cpp-format: | ||
sources: &cpp_source_files | ||
|
@@ -772,6 +773,7 @@ tasks: | |
- "components/clp-py-utils/clp_py_utils" | ||
- "components/core/tools/scripts/utils" | ||
- "components/job-orchestration/job_orchestration" | ||
- "tools/deployment" | ||
- "tools/scripts" | ||
Comment on lines
+776
to
777
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Black/Ruff may recurse through large non-code trees under tools/deployment. Running 🤖 Prompt for AI Agents
|
||
- "docs/conf" | ||
cmd: |- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
PRESTO_COORDINATOR_HTTPPORT=8080 | ||
PRESTO_COORDINATOR_SERVICENAME=presto-coordinator | ||
|
||
# node.properties | ||
PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT=production | ||
anlowee marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# clp.properties | ||
PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE=mysql | ||
PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER=mysql | ||
|
||
# config.properties | ||
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY=1GB | ||
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE=1GB | ||
|
||
# jvm.config | ||
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE=4G | ||
PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE=32M | ||
|
||
# log.properties | ||
PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL=INFO |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
connector.name=clp | ||
clp.metadata-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE} | ||
clp.metadata-db-url=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL} | ||
clp.metadata-db-name=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME} | ||
clp.metadata-db-user=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_USER} | ||
clp.metadata-db-password=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_PASSWORD} | ||
clp.metadata-table-prefix=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_TABLE_PREFIX} | ||
clp.split-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER} | ||
clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
coordinator=true | ||
node-scheduler.include-coordinator=false | ||
http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT} | ||
query.max-memory=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY} | ||
query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE} | ||
discovery-server.enabled=true | ||
discovery.uri=${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤖 Prompt for AI Agents
|
||
optimizer.optimize-hash-generation=false | ||
regex-library=RE2J | ||
use-alternative-function-signatures=true | ||
inline-sql-functions=false | ||
nested-data-serialization-enabled=false | ||
native-execution-enabled=true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
-server | ||
-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE} | ||
-XX:+UseG1GC | ||
-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE} | ||
-XX:+UseGCOverheadLimit | ||
-XX:+ExplicitGCInvokesConcurrent | ||
-XX:+HeapDumpOnOutOfMemoryError | ||
-XX:+ExitOnOutOfMemoryError | ||
-Djdk.attach.allowAttachSelf=true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
com.facebook.presto=${PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{ | ||
} | ||
Comment on lines
+1
to
+2
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Provide a minimal example or document expected schema An empty JSON object is syntactically valid, but future maintainers may be unsure what keys are supported. A commented exemplar or pointer to docs beside this file would improve clarity without affecting runtime. 🤖 Prompt for AI Agents
Comment on lines
+1
to
+2
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @anlowee I think we should explain to the user how to configure this for the timestamp field in their logs, right? And also that it may need to be different for each dataset they compress. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we direct them to the related presto-doc section? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not published and is more general than they need, right? We should write a simplified section for them here. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
node.environment=${PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT} | ||
node.id=${PRESTO_COORDINATOR_SERVICENAME} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -eu | ||
set -o pipefail | ||
|
||
readonly PRESTO_CONFIG_DIR="/opt/presto-server/etc" | ||
|
||
# Substitute environment variables in config template | ||
find /configs -type f | while read -r f; do | ||
( | ||
echo "cat <<EOF" | ||
cat "$f" | ||
echo "EOF" | ||
) | sh >"${PRESTO_CONFIG_DIR}/$(basename "$f")" | ||
done | ||
|
||
# Remove existing catalog files that exist in the image and add the CLP catalog | ||
rm -f "${PRESTO_CONFIG_DIR}/catalog/"* | ||
mv "${PRESTO_CONFIG_DIR}/clp.properties" "${PRESTO_CONFIG_DIR}/catalog" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
services: | ||
presto-coordinator: | ||
image: "ghcr.io/y-scope/presto/coordinator:dev" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider using stable image tags instead of 'dev' Using the 'dev' tag in production deployments can lead to unpredictable behavior as the image content may change without notice. Consider using specific version tags or at minimum a more stable tag: - image: "ghcr.io/y-scope/presto/coordinator:dev"
+ image: "ghcr.io/y-scope/presto/coordinator:${PRESTO_VERSION:-latest}" And similarly for the worker: - image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
+ image: "ghcr.io/y-scope/presto/prestissimo-worker:${PRESTO_VERSION:-latest}" 🤖 Prompt for AI Agents
|
||
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"] | ||
env_file: | ||
- ".env" | ||
- "coordinator-common.env" | ||
- "coordinator.env" | ||
volumes: | ||
- "./coordinator/config-template:/configs:ro" | ||
- "./coordinator/scripts:/scripts:ro" | ||
- "coordinator-config:/opt/presto-server/etc" | ||
networks: | ||
- "presto" | ||
healthcheck: | ||
test: | ||
- "CMD" | ||
- "curl" | ||
- "-f" | ||
- "${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}/v1/info" | ||
interval: "10s" | ||
retries: 30 | ||
|
||
presto-worker: | ||
image: "ghcr.io/y-scope/presto/prestissimo-worker:dev" | ||
depends_on: | ||
presto-coordinator: | ||
condition: "service_healthy" | ||
entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"] | ||
env_file: | ||
- ".env" | ||
- "coordinator-common.env" | ||
- "worker.env" | ||
volumes: | ||
- "./worker/config-template:/configs:ro" | ||
- "./worker/scripts:/scripts:ro" | ||
- "${CLP_ARCHIVES_DIR}:${CLP_ARCHIVES_DIR}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Verification agent 🧩 Analysis chainHost path mount requires documentation The Verify that the CLP_ARCHIVES_DIR environment variable is properly documented and validated: 🏁 Script executed: #!/bin/bash
# Search for documentation or validation of CLP_ARCHIVES_DIR
rg -A 5 -B 5 "CLP_ARCHIVES_DIR" Length of output: 2006 🏁 Script executed: #!/bin/bash
# Inspect environment templates for CLP_ARCHIVES_DIR entries and comments
echo "=== coordinator-common.env ==="
sed -n '1,200p' tools/deployment/presto-clp/coordinator-common.env
echo "---"
echo "=== worker.env ==="
sed -n '1,200p' tools/deployment/presto-clp/worker.env Length of output: 524 Document and validate CLP_ARCHIVES_DIR The • Add
This will make setup instructions clearer and surface any potential security issues before runtime. 🤖 Prompt for AI Agents
|
||
- "worker-config:/opt/presto-server/etc" | ||
networks: | ||
- "presto" | ||
|
||
volumes: | ||
coordinator-config: | ||
worker-config: | ||
|
||
networks: | ||
presto: | ||
driver: "bridge" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
/.venv/ |
Uh oh!
There was an error while loading. Please reload this page.