Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions presto-docs/src/main/sphinx/connector/clp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,126 @@ Each JSON log maps to this unified ``ROW`` type, with absent fields represented
``status``, ``thread_num``, ``backtrace``) become fields within the ``ROW``, clearly reflecting the nested and varying
structures of the original JSON logs.

CLP Functions
-------------

In semi-structured logs, the number of potential keys can grow significantly, resulting in extremely wide Presto tables
with many columns. To manage this complexity, the metadata provider may expose only a subset of the full schema,
typically the static fields or those most relevant to expected queries.

To enable access to dynamic or less common fields not present in the exposed schema, CLP provides two set of functions
to help users query flexible log schemas while keeping the table metadata definition concise. These functions are only
available in the CLP connector and are not part of standard Presto SQL.

- JSON path functions (e.g., ``CLP_GET_STRING``)
- Wildcard column matching functions for use in filter predicates (e.g., ``CLP_WILDCARD_STRING_COLUMN``)

There is **no performance penalty** for using these functions. During query optimization, they are rewritten into
references to actual schema-backed columns or valid symbols in KQL queries. This avoids additional parsing overhead and
delivers performance comparable to querying standard columns.

Path-Based Functions
^^^^^^^^^^^^^^^^^^^^

.. function:: CLP_GET_STRING(varchar) -> varchar

Returns the string value of the given JSON path, where the column type is one of: ``ClpString``, ``VarString``, or
``DateString``. Returns a Presto ``VARCHAR``.

.. function:: CLP_GET_BIGINT(varchar) -> bigint

Returns the integer value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.

.. function:: CLP_GET_DOUBLE(varchar) -> double

Returns the double value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``.

.. function:: CLP_GET_BOOL(varchar) -> boolean

Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto ``BOOLEAN``.

Comment on lines +343 to +346
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix return-type description

The doc for CLP_GET_BOOL says it returns “the double value,” which is a copy-paste typo. Update the text to state it returns the boolean value.

🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst around lines 343 to 346, update
the return-type description for CLP_GET_BOOL which currently reads "Returns the
double value..." to instead say it returns the boolean value; change the phrase
to "Returns the boolean value of the given JSON path, where the column type is
``Boolean``. Returns a Presto ``BOOLEAN``." to fix the copy-paste typo.

.. function:: CLP_GET_STRING_ARRAY(varchar) -> array(varchar)

Returns the array value of the given JSON path, where the column type is ``UnstructuredArray`` and converts each
element into a string. Returns a Presto ``ARRAY(VARCHAR)``.

.. note::

- JSON paths must be **constant string literals**; variables are not supported.
- Wildcards (e.g., ``msg.*.ts``) are **not supported**.
- If a path is invalid or missing, the function returns ``NULL`` rather than raising an error.

Examples:

.. code-block:: sql

SELECT CLP_GET_STRING(msg.author) AS author
FROM clp.default.table_1
WHERE CLP_GET_INT('msg.timestamp') > 1620000000;

Comment on lines +362 to +365
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Correct function name in example

CLP_GET_INT is not documented (the integer accessor you describe is CLP_GET_BIGINT). Please update the example to call the published function to avoid sending users on a wild goose chase.

🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst around lines 362 to 365, the
example uses a non-existent function name CLP_GET_INT; replace it with the
documented integer accessor CLP_GET_BIGINT so the example matches the published
API and will work for users; update the WHERE clause to call
CLP_GET_BIGINT('msg.timestamp') and keep the rest of the example unchanged.

SELECT CLP_GET_STRING_ARRAY(msg.tags) AS tags
FROM clp.default.table_2
WHERE CLP_GET_BOOL('msg.is_active') = true;
Comment on lines +362 to +368
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix JSON path examples to match literal requirement

Both sample queries pass msg.author, msg.timestamp, and msg.tags without quotes, yet the note just above states JSON paths must be constant string literals. Update the examples to use quoted paths (e.g., CLP_GET_STRING('msg.author')) so readers aren’t misled into syntax errors.

🤖 Prompt for AI Agents
presto-docs/src/main/sphinx/connector/clp.rst around lines 362 to 368: the
example queries use unquoted JSON path identifiers (msg.author, msg.timestamp,
msg.tags) which contradict the note that JSON paths must be constant string
literals; update the examples to pass quoted string literals (e.g.,
CLP_GET_STRING('msg.author'), CLP_GET_INT('msg.timestamp'),
CLP_GET_STRING_ARRAY('msg.tags')) and similarly quote the boolean path
(CLP_GET_BOOL('msg.is_active')) so the examples match the documented literal
requirement.



Wildcard Column Functions
^^^^^^^^^^^^^^^^^^^^^^^^^

These functions are used to apply filter predicates across all columns of a certain type. They are useful for searching
across unknown or dynamic schemas without specifying exact column names. Similar to the path-based functions, these
functions are rewritten during query optimization to a KQL query that matches the appropriate columns.

.. function:: CLP_WILDCARD_STRING_COLUMN() -> varchar

Represents all columns of CLP types: ``ClpString``, ``VarString``, and ``DateString``.

.. function:: CLP_WILDCARD_INT_COLUMN() -> bigint

Represents all columns of CLP type: ``Integer``.

.. function:: CLP_WILDCARD_FLOAT_COLUMN() -> double

Represents all columns of CLP type: ``Float``.

.. function:: CLP_WILDCARD_BOOL_COLUMN() -> boolean

Represents all columns of CLP type: ``Boolean``.

.. note::

- They must appear **only in filter conditions** (`WHERE` clause). They cannot be selected or passed as arguments
to other functions.
- Supported operators includes:

::

= (EQUAL)
!= (NOT_EQUAL)
< (LESS_THAN)
<= (LESS_THAN_OR_EQUAL)
> (GREATER_THAN)
>= (GREATER_THAN_OR_EQUAL)
LIKE
BETWEEN
IN

Use of other operators (e.g., arithmetic or function calls) with wildcard functions is not allowed and will result
in a query error.

Examples:

.. code-block:: sql

-- Matches if any string column contains "Beijing"
SELECT *
FROM clp.default.table_1
WHERE CLP_WILDCARD_STRING_COLUMN() = 'Beijing';

-- Matches if any integer column equals 1
SELECT *
FROM clp.default.table_2
WHERE CLP_WILDCARD_INT_COLUMN() = 1;

***********
SQL support
***********
Expand Down
Loading