y-scope · anlowee · Jul 28, 2025 · Jul 27, 2025 · Jul 27, 2025 · Jul 27, 2025
@@ -12,6 +12,13 @@ Using object storage
 Using CLP to ingest logs from object storage and store archives on object storage.
 :::
 
+:::{grid-item-card}
+:link: guides-using-presto
+Using Presto with CLP
+^^^
+How to use Presto to query compressed logs in CLP.
+:::
+
 :::{grid-item-card}
 :link: guides-multi-node
 Multi-node deployment

@@ -0,0 +1,178 @@
+# Using Presto with CLP
+
+[Presto] is a distributed SQL query engine that can be used to query data stored in CLP (using SQL).
+This guide describes how to set up and use Presto with CLP.
+
+:::{warning}
+Currently, only the [clp-json](quick-start/clp-json.md) flavor of CLP supports queries through
+Presto.
+:::
+
+:::{note}
+Currently, this integration with Presto is under development and may change in the future. It is
+also being maintained in a [fork][yscope-presto] of the Presto project. We are working on merging
+these changes into the main Presto repository so that you can use official Presto releases with CLP.
+:::
+
+## Requirements
+
+* [CLP][clp-releases] (clp-json) v0.4.0 or higher
+* [Docker] v28 or higher
+* [Docker Compose][docker-compose] v2.20.2 or higher
+* Python
+* python3-venv (for the version of Python installed)
+
+## Set up
+
+Using Presto with CLP requires:
+
+* [Setting up CLP](#setting-up-clp) and compressing some logs.
+* [Setting up Presto](#setting-up-presto) to query CLP's metadata database and archives.
+
+### Setting up CLP
+
+Follow the [quick-start](./quick-start/index.md) guide to set up CLP and compress your logs. A
+sample dataset that works well with Presto is [postgresql].
+
+### Setting up Presto
+
+1. Clone the CLP repository:
+
+    ```bash
+    git clone https://github.com/y-scope/clp.git
+    ```
+
+2. Navigate to the `tools/deployment/presto-clp` directory in your terminal.
+3. Run the following script to generate the necessary config for Presto to work with CLP:
+
+    ```bash
+    scripts/set-up-config.sh <clp-json-dir>
+    ```
+
+    * `<clp-json-dir>` is the location of the clp-json package you set up in the previous section.
+
+    Note that for the metadata filter config (i.e., 
+    `tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json`), it is a config
+    to indicate which columns are used for filtering splits that will be processed by Presto. Here
+    is an example:
+
+    ```json
+    {
+      "clp.default.default": [
+        {
+          "columnName": "timestamp",
+          "rangeMapping": {
+            "lowerBound": "begin_timestamp",
+            "upperBound": "end_timestamp"
+          },
+          "required": false
+        }
+      ]
+    }
+    ```
+
+    * `"clp.default.default"` is the filter's scope. A scope can be one of the following:
+
+      * A catalog name
+      * A fully-qualified schema name
+      * A fully-qualified table name
+
+      Filter configs under a particular scope will apply to all child scopes. For example, filter
+      configs at the schema level will apply to all tables within that schema. In this example,
+      the filter will only apply to the `default` table under the `default` schema of the `clp`
+      catalog.
+
+    * `"columnName"` is the data column's name. You can use the column used as `--timestamp-key`
+    when compressing if you want to filter splits by timestamp.
+
+    * `"rangeMapping"` is an optional object with the following properties:
+
+      * `"lowerBound"` is the metadata column that represents the lower bound of values in a split
+      for the data column.
+      * `"upperBound"` is the metadata column that represents the upper bound of values in a split
+      for the data column.
+
+      In this example, since in CLP's metadata database, for each split (i.e., archive) there are
+      two fields `begin_timestamp` and `end_timestamp` to store the earilest and latest timestamps
+      of the log messages compressed in that split, we have to remap the original data column's
+      name to these two fields so that it can query the metadata database to retrieve filtered
+      splits.
+
+    * `"required"` is an optional field (defaults to false) which indicates whether the filter must
+    be present in the translated metadata filter SQL query. If a required filter is missing or
+    cannot be pushed down, the query will be rejected.
+
+4. Start a Presto cluster by running:
+
+    ```bash
+    docker compose up
+    ```
+
+    * To use more than Presto worker, you can use the `--scale` option as follows:
+
+      ```bash
-    * To use more than Presto worker, you can use the `--scale` option as follows:
-
-      ```bash
+* To use more than one Presto worker, you can use the `--scale` option as follows:
-    * To use more than Presto worker, you can use the `--scale` option as follows:
-
-      ```bash
+* To use more than one Presto worker, you can use the `--scale` option as follows:
+      docker compose up --scale presto-worker=<num-workers>
+      ```
+
+      * `<num-workers>` is the number of Presto worker nodes you want to run.
-    * To use more than Presto worker, you can use the `--scale` option as follows:
-
-      ```bash
-      docker compose up --scale presto-worker=<num-workers>
-      ```
-
-      * `<num-workers>` is the number of Presto worker nodes you want to run.
+* To use multiple Presto workers (beneficial for larger datasets or improved query parallelism), you can use the `--scale` option as follows:
-    * To use more than Presto worker, you can use the `--scale` option as follows:
-
-      ```bash
-      docker compose up --scale presto-worker=<num-workers>
-      ```
-
-      * `<num-workers>` is the number of Presto worker nodes you want to run.
+* To use multiple Presto workers (beneficial for larger datasets or improved query parallelism), you can use the `--scale` option as follows:
+
+### Stopping the Presto cluster
+
+To stop the Presto cluster, use CTRL + C.
+
+To clean up the Presto cluster entirely:
+
+```bash
+docker compose rm
+```
+
+## Querying your logs through Presto
+
+To query your logs through Presto, you can use the Presto CLI:
+
+```bash
+docker compose exec presto-coordinator \
+  presto-cli \
+    --catalog clp \
+    --schema default
+```
+
+Each dataset in CLP shows up as a table in Presto. To show all available datasets:
+
+```sql
+SHOW TABLES;
+```
+
+If you didn't specify a dataset when compressing your logs in CLP, your logs will have been stored
+in the `default` dataset. If you also didn't specify any metadata filters, you can query the logs
+in this dataset:
+
+```sql
+SELECT * FROM default LIMIT 1;
+```
+
+All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
+contain the field `foo.bar`, you can query it using:
+
+```sql
+SELECT foo.bar FROM default LIMIT 1;
+```
-All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
-contain the field `foo.bar`, you can query it using:
-
-```sql
-SELECT foo.bar FROM default LIMIT 1;
-```
+All key-value (KV) pairs in each log event can be queried directly using dot notation. For example, if your logs
+contain the field `foo.bar`, you can query it using:
+
-All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
-contain the field `foo.bar`, you can query it using:
-
-```sql
-SELECT foo.bar FROM default LIMIT 1;
-```
+All key-value (KV) pairs in each log event can be queried directly using dot notation. For example, if your logs
+contain the field `foo.bar`, you can query it using:
+
+
+## Limitations
+
+The Presto CLP integration has the following limitations at present:
+
+* Nested fields containing special characters (i.e., any non-alphanumeric characters except `_`;
+see [y-scope/presto#8]). To get around this limitation,you will need to preprocess your logs to
+remove such special characters.
+* Only logs stored on the filesystem, rather than S3, can be queried through Presto.
+
+These limitations will be addressed in a future release of the Presto integration.
+
+[clp-releases]: https://github.com/y-scope/clp/releases
+[docker-compose]: https://docs.docker.com/compose/install/
+[Docker]: https://docs.docker.com/engine/install/
+[postgresql]: https://zenodo.org/records/10516401
+[Presto]: https://prestodb.io/
+[y-scope/presto#8]: https://github.com/y-scope/presto/issues/8
+[yscope-presto]: https://github.com/y-scope/presto
@@ -62,6 +62,7 @@ quick-start/clp-text
 guides-overview
 guides-using-object-storage/index
 guides-multi-node
+guides-using-presto
 :::
 
 :::{toctree}

@@ -103,7 +103,8 @@ tasks:
           components/package-template/src/etc \
           docs \
           taskfile.yaml \
-          taskfiles
+          taskfiles \
+          tools/deployment
 
   check-cpp-format:
     sources: &cpp_source_files
@@ -772,6 +773,7 @@ tasks:
           - "components/clp-py-utils/clp_py_utils"
           - "components/core/tools/scripts/utils"
           - "components/job-orchestration/job_orchestration"
+          - "tools/deployment"
           - "tools/scripts"
           - "docs/conf"
         cmd: |-

@@ -0,0 +1,5 @@
+PRESTO_COORDINATOR_HTTPPORT=8080
+PRESTO_COORDINATOR_SERVICENAME=presto-coordinator
+
+# node.properties
+PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT=production
@@ -0,0 +1,14 @@
+# clp.properties
+PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE=mysql
+PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER=mysql
+
+# config.properties
+PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY=1GB
+PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE=1GB
+
+# jvm.config
+PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE=4G
+PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE=32M
+
+# log.properties
+PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL=INFO
@@ -0,0 +1,9 @@
+connector.name=clp
+clp.metadata-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE}
+clp.metadata-db-url=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL}
+clp.metadata-db-name=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME}
+clp.metadata-db-user=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_USER}
+clp.metadata-db-password=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_PASSWORD}
+clp.metadata-table-prefix=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_TABLE_PREFIX}
+clp.split-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER}
+clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json
@@ -0,0 +1,13 @@
+coordinator=true
+node-scheduler.include-coordinator=false
+http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT}
+query.max-memory=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY}
+query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE}
+discovery-server.enabled=true
+discovery.uri=${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}
+optimizer.optimize-hash-generation=false
+regex-library=RE2J
+use-alternative-function-signatures=true
+inline-sql-functions=false
+nested-data-serialization-enabled=false
+native-execution-enabled=true
@@ -0,0 +1,9 @@
+-server
+-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE}
+-XX:+UseG1GC
+-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE}
+-XX:+UseGCOverheadLimit
+-XX:+ExplicitGCInvokesConcurrent
+-XX:+HeapDumpOnOutOfMemoryError
+-XX:+ExitOnOutOfMemoryError
+-Djdk.attach.allowAttachSelf=true
@@ -0,0 +1 @@
+com.facebook.presto=${PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL}
@@ -0,0 +1,2 @@
+{
+}
@@ -0,0 +1,2 @@
+node.environment=${PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT}
+node.id=${PRESTO_COORDINATOR_SERVICENAME}
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+set -eu
+set -o pipefail
+
+readonly PRESTO_CONFIG_DIR="/opt/presto-server/etc"
+
+# Substitute environment variables in config template
+find /configs -type f | while read -r f; do
+    (
+        echo "cat <<EOF"
+        cat "$f"
+        echo "EOF"
+    ) | sh >"${PRESTO_CONFIG_DIR}/$(basename "$f")"
+done
+
+# Remove existing catalog files that exist in the image and add the CLP catalog
+rm -f "${PRESTO_CONFIG_DIR}/catalog/"*
+mv "${PRESTO_CONFIG_DIR}/clp.properties" "${PRESTO_CONFIG_DIR}/catalog"
@@ -0,0 +1,48 @@
+services:
+  presto-coordinator:
+    image: "ghcr.io/y-scope/presto/coordinator:dev"
+    entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
+    env_file:
+      - ".env"
+      - "coordinator-common.env"
+      - "coordinator.env"
+    volumes:
+      - "./coordinator/config-template:/configs:ro"
+      - "./coordinator/scripts:/scripts:ro"
+      - "coordinator-config:/opt/presto-server/etc"
+    networks:
+      - "presto"
+    healthcheck:
+      test:
+        - "CMD"
+        - "curl"
+        - "-f"
+        - "${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}/v1/info"
+      interval: "10s"
+      retries: 30
+
+  presto-worker:
+    image: "ghcr.io/y-scope/presto/prestissimo-worker:dev"
+    depends_on:
+      presto-coordinator:
+        condition: "service_healthy"
+    entrypoint: ["/bin/bash", "-c", "/scripts/generate-configs.sh && /opt/entrypoint.sh"]
+    env_file:
+      - ".env"
+      - "coordinator-common.env"
+      - "worker.env"
+    volumes:
+      - "./worker/config-template:/configs:ro"
+      - "./worker/scripts:/scripts:ro"
+      - "${CLP_ARCHIVES_DIR}:${CLP_ARCHIVES_DIR}"
+      - "worker-config:/opt/presto-server/etc"
+    networks:
+      - "presto"
+
+volumes:
+  coordinator-config:
+  worker-config:
+
+networks:
+  presto:
+    driver: "bridge"
@@ -0,0 +1 @@
+/.venv/
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		com.facebook.presto=${PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		{
		}
Comment on lines +1 to +2 Copy link Contributor coderabbitai bot Jul 27, 2025 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick (assertive) Provide a minimal example or document expected schema An empty JSON object is syntactically valid, but future maintainers may be unsure what keys are supported. A commented exemplar or pointer to docs beside this file would improve clarity without affecting runtime. 🤖 Prompt for AI Agents `In tools/deployment/presto-clp/coordinator/config-template/metadata-filter.json at lines 1 to 2, the JSON file is currently empty, which may confuse future maintainers about the expected keys. Add a minimal example JSON object with typical keys and values or include comments or a reference to documentation explaining the expected schema to improve clarity without impacting runtime behavior.` Comment on lines +1 to +2 Copy link Member kirkrodrigues Jul 28, 2025 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. @anlowee I think we should explain to the user how to configure this for the timestamp field in their logs, right? And also that it may need to be different for each dataset they compress. Copy link Contributor Author anlowee Jul 28, 2025 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. Can we direct them to the related presto-doc section? Copy link Member kirkrodrigues Jul 28, 2025 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. That's not published and is more general than they need, right? We should write a simplified section for them here.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		node.environment=${PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT}
		node.id=${PRESTO_COORDINATOR_SERVICENAME}