Skip to content

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Jul 9, 2025

Description

This PR introduces new user-defined functions (UDFs) for the CLP connector to improve querying of semi-structured logs:

  • CLP_GET_* functions for retrieving values from JSON paths, targeting specific CLP column types (ClpString, Integer, Float, etc.). These UDFs enable direct access to dynamic fields and return corresponding Presto native types.
  • CLP_WILDCARD_* functions that represent wildcard filters over all columns of a given CLP type. These take no arguments and are designed to be used in filter predicates to simplify querying across dynamic schemas.

Both sets of UDFs are integrated with the query rewriting layer. During query optimization, calls to these functions are rewritten into normal column references or KQL query column symbols, ensuring efficient execution with no additional parsing overhead.

The PR also updates the documentation to clearly describe the semantics and usage of these UDFs, along with examples demonstrating their use.

This PR fixes #29.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

All unit tests passed. End-to-end testing worked.

Summary by CodeRabbit

Summary by CodeRabbit

New Features

  • Introduced new CLP functions for querying semi-structured logs, including path-based and wildcard column functions.
  • Enhanced support for pushdown of filter and projection expressions involving CLP user-defined functions (UDFs).
  • Added support for context-aware processing of expressions and improved handling of dynamic or wide schemas.
  • Extended query plan optimization to handle CLP_GET functions in projections and filters.

Bug Fixes

  • Improved robustness when handling missing or empty metadata filter configuration paths.

Documentation

  • Added comprehensive documentation for new CLP functions, including usage notes and SQL examples.

Tests

  • Introduced new and updated tests to cover CLP UDFs, wildcard functions, and plan optimization scenarios.

Chores

  • Updated dependencies and internal configuration to support new features and testing.

Copy link

coderabbitai bot commented Jul 9, 2025

Walkthrough

This change introduces new CLP-specific scalar functions and extends the Presto CLP connector to support path-based and wildcard column functions for semi-structured log querying. It enhances the optimizer and filter-to-KQL converter to handle these UDFs, updates associated tests and documentation, and adds new dependency declarations.

Changes

File(s) Change Summary
presto-clp/pom.xml Added dependencies: presto-tests, presto-main-base (test scope), and presto-expressions (compile scope).
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java New class defining CLP path-based and wildcard column scalar functions for Presto SQL.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java Refactored to support CLP UDFs and wildcard functions with context-aware rewriting, enhanced visitor signatures, and new logic for pushdown and translation.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java Extended optimizer to support pushdown for CLP_GET functions in projections and filters, added handling for new plan shapes and residual predicates.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java Added getFunctions() to expose CLP functions via the plugin interface.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java Guard clause added to handle missing/empty metadata filter config path, avoiding unnecessary file reads.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java Added tests for CLP_GET and CLP_WILDCARD UDFs; updated pushdown test helpers to track and verify UDF variables.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java New test class verifying plan optimization for CLP_GET and wildcard function projections and filters.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java Registered CLP functions for tests; restructured RowType for schema columns.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java Added getter for dbPath in static class DbHandle.
presto-docs/src/main/sphinx/connector/clp.rst Added documentation for CLP path-based and wildcard column functions, with usage notes and SQL examples.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PrestoSQL
    participant ClpPlugin
    participant ClpFunctions
    participant ClpPlanOptimizer
    participant ClpFilterToKqlConverter

    User->>PrestoSQL: Submit query with CLP_GET_* or CLP_WILDCARD_* functions
    PrestoSQL->>ClpPlugin: Load connector and functions
    PrestoSQL->>ClpPlanOptimizer: Optimize query plan
    ClpPlanOptimizer->>ClpFilterToKqlConverter: Rewrite and pushdown filters/projections
    ClpFilterToKqlConverter->>ClpFunctions: Map UDFs to variables/columns or KQL symbols
    ClpPlanOptimizer-->>PrestoSQL: Return optimized plan
    PrestoSQL-->>User: Execute and return results
Loading

Assessment against linked issues

Objective Addressed Explanation
Make the metadata filter optional. [#29] -

Possibly related PRs

  • y-scope/presto#16: Introduced the original ClpFilterToKqlConverter and initial filter pushdown optimizer; this PR builds directly on and extends that logic.
  • y-scope/presto#19: Also modifies ClpFilterToKqlConverter, focusing on metadata SQL filters and BETWEEN support; related through complementary filter enhancements.

Suggested reviewers

  • anlowee
  • kirkrodrigues

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5211850 and 725e810.

📒 Files selected for processing (1)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: test (17.0.13, :presto-main-base)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-local-queries)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-resource-manager)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-aggregation-queries)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-non-hash-gen)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-resource-manager)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-tpch-distributed-queries)
  • GitHub Check: test (17.0.13, :presto-tests -P presto-tests-general)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-queries)
  • GitHub Check: test (8.0.442, :presto-main)
  • GitHub Check: test (17.0.13, :presto-tests -P presto-tests-execution-memory)
  • GitHub Check: test (8.0.442, :presto-main-base)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-non-hash-gen)
  • GitHub Check: test (8.0.442, :presto-tests -P presto-tests-execution-memory)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-aggregation-queries)
  • GitHub Check: test (8.0.442, :presto-tests -P presto-tests-general)
  • GitHub Check: test (:presto-docs)
  • GitHub Check: maven-checks (17.0.13)
  • GitHub Check: maven-checks (8.0.442)
🔇 Additional comments (3)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java (3)

33-72: Consistent implementation of CLP_GET_ functions.*

The CLP_GET_* functions are well-designed with consistent signatures accepting JSON path parameters and returning appropriate default values. The non-deterministic annotation is correctly applied since these get rewritten during optimization.


74-104: Well-implemented wildcard column functions.

The wildcard functions correctly take no parameters as intended and return appropriate default values for their respective types. The descriptions clearly explain their purpose in filter expressions.


33-104: Confirm Intentional Removal of CLP_GET_JSON_STRING

I performed a repository-wide search for any occurrences of the CLP_GET_JSON_STRING function and related identifiers, but found no results. Please verify whether this function was intentionally removed or if it still needs to be implemented:

  • Searched for “CLP_GET_JSON_STRING” across all files – no matches found
  • Searched for “CLP_GET_JSON” to catch any related identifiers – no matches found
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)

870-901: Consider optimizing string building for non-pushdownable expressions.

The logic correctly handles partial pushdown, but unnecessarily builds query strings even when allPushedDown becomes false. Consider checking pushdownability first before building strings.

     private ClpExpression handleOr(SpecialFormExpression node, Set<VariableReferenceExpression> context)
     {
+        // First pass: check if all children can be pushed down
+        List<ClpExpression> childExpressions = new ArrayList<>();
+        for (RowExpression argument : node.getArguments()) {
+            ClpExpression expression = argument.accept(this, context);
+            if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) {
+                return new ClpExpression(node);
+            }
+            childExpressions.add(expression);
+        }
+        
+        // Second pass: build the query
         StringBuilder metadataQueryBuilder = new StringBuilder();
         metadataQueryBuilder.append("(");
         StringBuilder queryBuilder = new StringBuilder();
         queryBuilder.append("(");
-        boolean allPushedDown = true;
         boolean hasAllMetadataSql = true;
-        for (RowExpression argument : node.getArguments()) {
-            ClpExpression expression = argument.accept(this, context);
-            if (expression.getRemainingExpression().isPresent() || !expression.getPushDownExpression().isPresent()) {
-                allPushedDown = false;
-                continue;
-            }
+        for (ClpExpression expression : childExpressions) {
             queryBuilder.append(expression.getPushDownExpression().get());
             queryBuilder.append(" OR ");
             if (hasAllMetadataSql && expression.getMetadataSqlQuery().isPresent()) {
                 metadataQueryBuilder.append(expression.getMetadataSqlQuery().get());
                 metadataQueryBuilder.append(" OR ");
             }
             else {
                 hasAllMetadataSql = false;
             }
         }
-        if (allPushedDown) {
-            // Remove the last " OR " from the query
-            return new ClpExpression(
-                    queryBuilder.substring(0, queryBuilder.length() - 4) + ")",
-                    hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null);
-        }
-        return new ClpExpression(node);
+        // Remove the last " OR " from the query
+        return new ClpExpression(
+                queryBuilder.substring(0, queryBuilder.length() - 4) + ")",
+                hasAllMetadataSql ? metadataQueryBuilder.substring(0, metadataQueryBuilder.length() - 4) + ")" : null);
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c86114 and 5211850.

📒 Files selected for processing (11)
  • presto-clp/pom.xml (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (25 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java (1 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java (2 hunks)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java (1 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java (1 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java (4 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java (1 hunks)
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (3 hunks)
  • presto-docs/src/main/sphinx/connector/clp.rst (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java (1)
Learnt from: wraymo
PR: y-scope/presto#15
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/metadata/ClpMetadataProvider.java:22-33
Timestamp: 2025-06-13T12:56:06.325Z
Learning: `ClpMetadataProvider` is instantiated only once and used solely by the Presto coordinator, so concurrency/thread-safety guarantees are unnecessary.
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)
Learnt from: anlowee
PR: y-scope/presto#0
File: :0-0
Timestamp: 2025-06-19T08:41:46.288Z
Learning: In the y-scope/presto repository, prefer using Guava's Immutable collections over regular collections when possible.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: test (17.0.13, :presto-main-base)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-non-hash-gen)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-plan-determinism)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-tpch-distributed-queries)
  • GitHub Check: test (17.0.13, :presto-tests -P ci-only-distributed-queries)
  • GitHub Check: test (17.0.13, :presto-tests -P presto-tests-general)
  • GitHub Check: test (8.0.442, :presto-main-base)
  • GitHub Check: test (17.0.13, :presto-tests -P presto-tests-execution-memory)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-aggregation-queries)
  • GitHub Check: test (8.0.442, :presto-main)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-resource-manager)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-plan-determinism)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-queries)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-tpch-distributed-queries)
  • GitHub Check: test (8.0.442, :presto-tests -P presto-tests-execution-memory)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-distributed-non-hash-gen)
  • GitHub Check: test (:presto-docs)
  • GitHub Check: maven-checks (17.0.13)
  • GitHub Check: maven-checks (8.0.442)
🔇 Additional comments (10)
presto-clp/src/test/java/com/facebook/presto/plugin/clp/ClpMetadataDbSetUp.java (1)

230-233: LGTM! Clean getter implementation.

The getter method is well-implemented and follows standard Java conventions. This addition properly supports the enhanced testing infrastructure for the new CLP UDF functionality.

presto-clp/pom.xml (1)

143-159: LGTM! Dependencies appropriately support the new CLP UDF functionality.

The added dependencies are well-chosen and properly scoped:

  • presto-tests and presto-main-base (test-jar) are correctly test-scoped
  • presto-expressions is needed for the new expression rewriting capabilities

These additions align perfectly with the PR objectives to add CLP UDFs with query rewriting support.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlugin.java (1)

26-30: LGTM! Proper function registration implementation.

The getFunctions() method correctly implements the standard Presto plugin pattern for registering scalar functions. The use of ImmutableSet and proper return type ensures the CLP UDFs are properly exposed to the Presto runtime.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpMetadataFilterProvider.java (1)

87-91: LGTM! Excellent defensive programming.

The guard clause properly handles null/empty configuration paths by returning an empty filter map early. This prevents unnecessary file I/O operations and provides graceful fallback behaviour when metadata filter configuration is not provided.

presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpQueryBase.java (2)

67-69: LGTM! Proper function registration for testing.

The static block correctly registers the CLP UDFs using FunctionExtractor.extractFunctions(), ensuring these functions are available during test execution. This is the standard approach for testing custom functions in Presto.


84-87: LGTM! Table structure updates align with schema enhancements.

The reordering and simplification of the nested RowType fields appears to align with the enhanced plan optimization and function pushdown logic introduced in this PR. The structure remains logically consistent.

presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpPlanOptimizer.java (1)

1-309: Well-structured test implementation

The test class provides comprehensive coverage for the CLP plan optimizer with proper setup/teardown and clear test scenarios covering projection and filter pushdown cases.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java (1)

238-240: Use parameterized logging for better security

Direct string concatenation in log statements could be a security concern if the data contains sensitive information.

-                metadataSqlQuery = Optional.of(metadataFilterProvider.remapFilterSql(scope, metadataSqlQuery.get()));
-                log.debug("Metadata SQL query: %s", metadataSqlQuery);
+                String remappedQuery = metadataFilterProvider.remapFilterSql(scope, metadataSqlQuery.get());
+                metadataSqlQuery = Optional.of(remappedQuery);
+                log.debug("Metadata SQL query: %s", remappedQuery);
⛔ Skipped due to learnings
Learnt from: anlowee
PR: y-scope/presto#0
File: :0-0
Timestamp: 2025-06-19T08:41:46.288Z
Learning: In the y-scope/presto repository, prefer static imports for commonly used methods like String.format(). Instead of writing `String.format("%s", value)`, use static import `import static java.lang.String.format;` and call `format("%s", value)` directly.
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java (1)

272-308: Comprehensive test coverage for CLP UDFs

Excellent test coverage for both CLP_GET_* and CLP_WILDCARD_* functions, including edge cases and combinations with other predicates.

presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java (1)

739-796: Correct implementation of CLP wildcard interpretation.

The method properly handles CLP_WILDCARD UDFs with appropriate validation and error handling. Good use of the Optional pattern and descriptive error messages.


.. code-block:: sql
SELECT CLP_GET_STRING(msg.author) AS author
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add quotes to the JSON path argument

According to line 335, JSON paths must be constant string literals. The example should use quotes.

-    SELECT CLP_GET_STRING(msg.author) AS author
+    SELECT CLP_GET_STRING('msg.author') AS author
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SELECT CLP_GET_STRING(msg.author) AS author
SELECT CLP_GET_STRING('msg.author') AS author
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 343, the JSON path
argument in the function call lacks quotes, which is required as per line 335
stating JSON paths must be constant string literals. Fix this by adding quotes
around the JSON path argument in the SELECT statement to ensure it is a valid
string literal.

Comment on lines +347 to +349
SELECT CLP_GET_ARRAY(msg.tags) AS tags
FROM clp.default.table_2
WHERE CLP_GET_BOOL('msg.is_active') = true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix function name in the example

The function CLP_GET_ARRAY doesn't exist. It should be CLP_GET_STRING_ARRAY.

-    SELECT CLP_GET_ARRAY(msg.tags) AS tags
+    SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SELECT CLP_GET_ARRAY(msg.tags) AS tags
FROM clp.default.table_2
WHERE CLP_GET_BOOL('msg.is_active') = true;
SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags
FROM clp.default.table_2
WHERE CLP_GET_BOOL('msg.is_active') = true;
🤖 Prompt for AI Agents
In the presto-docs/src/main/sphinx/connector/clp.rst file, around lines 347 to
349, the example query uses the incorrect function name `CLP_GET_ARRAY`. This
function does not exist. Instead, the correct function to use is
`CLP_GET_STRING_ARRAY` to retrieve an array of strings from the `msg.tags`
field. Update the query to use the correct `CLP_GET_STRING_ARRAY` function.

Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.

.. function:: CLP_GET_FLOAT(varchar) -> double

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the function description for CLP_GET_FLOAT

The description states "Returns the boolean value" but the function returns a DOUBLE, not a boolean.

-    Returns the boolean value of the given JSON path, where the column type is ``Float``.  Returns a Presto ``DOUBLE``.
+    Returns the float value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Returns the float value of the given JSON path, where the column type is ``Float``. Returns a Presto ``DOUBLE``.
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 320, the function
description for CLP_GET_FLOAT incorrectly states that it returns a boolean
value. Update the description to correctly state that the function returns a
DOUBLE value instead of a boolean.


.. function:: CLP_GET_INT(varchar) -> bigint

Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the function description for CLP_GET_INT

The description states "Returns the string value" but the function returns a BIGINT, not a string.

-    Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.
+    Returns the integer value of the given JSON path, where the column type is ``Integer``. Returns a Presto ``BIGINT``.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Returns the string value of the given JSON path, where the column type is ``Integer``, Returns a Presto ``BIGINT``.
Returns the integer value of the given JSON path, where the column type is ``Integer``. Returns a Presto ``BIGINT``.
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 317, the function
description for CLP_GET_INT incorrectly states it returns a string value. Update
the description to correctly state that the function returns a Presto BIGINT
value instead of a string.


.. function:: CLP_GET_BOOL(varchar) -> boolean

Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the function description for CLP_GET_BOOL

The description states "Returns the double value" but the function returns a BOOLEAN, not a double.

-    Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
+    Returns the boolean value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Returns the double value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
Returns the boolean value of the given JSON path, where the column type is ``Boolean``. Returns a Presto
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst at line 325, the function
description incorrectly states that CLP_GET_BOOL returns a double value, but it
actually returns a BOOLEAN. Update the description to accurately reflect that
the function returns a BOOLEAN value instead of a double.

Comment on lines 106 to 112
@ScalarFunction(value = "CLP_GET_JSON_STRING", deterministic = false)
@Description("Converts an entire log record into a JSON string.")
@SqlType(StandardTypes.VARCHAR)
public static Slice clpGetJSONString()
{
return Slices.EMPTY_SLICE;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider adding a parameter to CLP_GET_JSON_STRING

Unlike other CLP_GET functions, this one takes no parameters. Should it accept a JSON path parameter for consistency, or is it intended to return the entire record?


🌐 Web query:

What is the intended behavior of CLP_GET_JSON_STRING function in Presto CLP connector - should it take a JSON path parameter or return the entire log record?

💡 Result:

The intended behavior of the CLP_GET_JSON_STRING function in the Presto CLP connector is not documented in detail within the provided search results. However, based on best practices for similar JSON extraction functions in SQL engines and the available context around JSON path extraction and CLP encoding, the following evidence-supported conclusions can be drawn:

  • The typical approach for JSON extraction functions (such as JSONPATH in Apache Pinot or field mapping in the Presto Kafka connector) is to allow the user to specify a JSON path or field mapping[1][6][8]. The path points to the element or elements you want to extract from the JSON record rather than always returning the entire record.
  • For instance, the Presto Kafka connector's JSON decoder lets users map columns to “slash-separated list of field names to select a field from the JSON object”, which serves as a path to the desired property[1].
  • Similarly, other data platforms (like Pinot) provide functions such as JSONPATH(jsonField, 'jsonPath') where the second argument is a path expression that targets a specific field in the JSON structure[6].

Extracts the object value from jsonField based on 'jsonPath', the result type is inferred based on JSON value[6].

  • The CLP-specific stream ingestion feature in Pinot also makes it clear that fields are individually targeted and encoded/extracted based on configuration, not by returning the whole JSON record by default[8].

Given these conventions, the intended behavior of a function named CLP_GET_JSON_STRING in Presto’s CLP connector is most likely to accept a JSON path parameter and extract (or return) the value at that path from the log record—not just return the entire JSON log record. Requiring a path parameter also aligns with SQL and analytic platform norms for working with nested or semi-structured data.

If CLP_GET_JSON_STRING were designed to return the whole log record, there would be little benefit in calling it specifically, as fetching the entire record typically does not require a specialized function or parameterization.

In summary:
CLP_GET_JSON_STRING should be expected to take a JSON path parameter and return only the requested field or value, not the entire log record, following standard design patterns for JSON extraction in Presto and related systems[1][6][8]. If official documentation becomes available, it should be referenced for precise usage.

Citations:


Add a JSON path parameter to CLP_GET_JSON_STRING for consistency

Based on common SQL connector patterns and Presto’s own JSON extraction functions, CLP_GET_JSON_STRING should accept a JSON-path argument rather than always returning the full record.

Please update accordingly:

• File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
Lines: 106–112

• Change the signature from:

public static Slice clpGetJSONString()

to:

public static Slice clpGetJSONString(
    @SqlType(StandardTypes.VARCHAR) Slice jsonPath)

• Adjust the annotations:

@ScalarFunction(value = "CLP_GET_JSON_STRING", deterministic = false)
@Description("Extracts the JSON string at the given path from a log record.")
@SqlType(StandardTypes.VARCHAR)

• Implement lookup of the specified path inside the record (or throw a clear error if the path is invalid).

This aligns with Presto’s Kafka connector (JSONPATH(jsonField, 'path')) and other CLP-based extractors.

🤖 Prompt for AI Agents
In presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
around lines 106 to 112, update the clpGetJSONString method to accept a Slice
parameter annotated with @SqlType(StandardTypes.VARCHAR) representing the JSON
path. Modify the annotations to reflect that the function extracts the JSON
string at the given path from a log record. Implement logic inside the method to
parse the log record and extract the JSON value at the specified path, throwing
a clear error if the path is invalid or not found.

Comment on lines +156 to +157
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
"Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Improve error message for unsupported plan shapes

The error message could be more descriptive about which plan shapes are supported.

-            throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
-                    "Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName());
+            throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
+                    String.format("Unsupported plan shape for CLP pushdown: %s. Only Project->TableScan and Project->Filter->TableScan are supported.",
+                            childNode.getClass().getSimpleName()));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
"Unsupported plan shape for CLP pushdown: " + childNode.getClass().getSimpleName());
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
String.format("Unsupported plan shape for CLP pushdown: %s. Only Project->TableScan and Project->Filter->TableScan are supported.",
childNode.getClass().getSimpleName()));
🤖 Prompt for AI Agents
In presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpPlanOptimizer.java
at lines 156-157, the error message for unsupported plan shapes is vague.
Enhance the message by listing or describing the supported plan shapes to
provide clearer guidance on what is expected, improving the clarity of the
exception thrown.

this.standardFunctionResolution = requireNonNull(standardFunctionResolution, "standardFunctionResolution is null");
this.functionMetadataManager = requireNonNull(functionMetadataManager, "function metadata manager is null");
this.assignments = requireNonNull(assignments, "assignments is null");
this.assignments = new HashMap<>(requireNonNull(assignments, "assignments is null"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider using ImmutableMap.Builder for better immutability management.

Creating a mutable HashMap copy allows modifications but doesn't align with the repository's preference for Guava's Immutable collections. Since modifications are needed for CLP UDF processing, consider using ImmutableMap.Builder to build the final map after all transformations.

-        this.assignments = new HashMap<>(requireNonNull(assignments, "assignments is null"));
+        // Use a builder pattern to maintain immutability preference while allowing modifications
+        ImmutableMap.Builder<VariableReferenceExpression, ColumnHandle> builder = ImmutableMap.builder();
+        builder.putAll(requireNonNull(assignments, "assignments is null"));
+        // Store builder for use during processing, then build final immutable map when needed
+        this.assignmentsBuilder = builder;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
at line 118, replace the mutable HashMap copy of assignments with an
ImmutableMap.Builder to construct the map immutably. Initialize an
ImmutableMap.Builder, add all entries from the input assignments, perform any
necessary modifications on the builder, and then build the final ImmutableMap to
assign to this.assignments, ensuring immutability while allowing controlled
modifications.

Comment on lines +210 to +216
Object handle = assignments.get(variable);
if (!(handle instanceof ClpColumnHandle)) {
return null;
}
return ((ClpColumnHandle) handle).getOriginalColumnName();
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid returning null to prevent potential NullPointerException.

The method returns null when the handle is not a ClpColumnHandle, which could lead to NPEs or unexpected "null" string concatenation in callers. Consider returning Optional<String> or throwing an exception for invalid handles.

-    private String getVariableName(VariableReferenceExpression variable)
+    private Optional<String> getVariableName(VariableReferenceExpression variable)
     {
         Object handle = assignments.get(variable);
         if (!(handle instanceof ClpColumnHandle)) {
-            return null;
+            return Optional.empty();
         }
-        return ((ClpColumnHandle) handle).getOriginalColumnName();
+        return Optional.of(((ClpColumnHandle) handle).getOriginalColumnName());
     }

Then update callers to handle the Optional appropriately:

-    return new ClpExpression(getVariableName(node));
+    return getVariableName(node)
+            .map(ClpExpression::new)
+            .orElse(new ClpExpression(node));
🤖 Prompt for AI Agents
In the file
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
around lines 210 to 216, the getOriginalColumnName() method returns null when
the handle is not a ClpColumnHandle. This could lead to potential
NullPointerExceptions or unexpected "null" string concatenation in the calling
code. Refactor the method to return an Optional<String> instead of null, and
update the callers to handle the Optional appropriately. This will provide
better error handling and prevent null-related issues.

Comment on lines +217 to 281
/**
* Rewrites CLP UDFs (e.g., CLP_GET_*) in a RowExpression tree into VariableReferenceExpressions,
* enabling them to be used as pushdown filters.
* <p></p>
* Traverses the expression tree recursively, replacing supported CLP UDFs with uniquely named
* variables and tracking these variables in the assignments and context. If the CLP UDF takes
* a constant string argument, that string is used as the new variable name. Unsupported
* argument types (non-constants) or invalid expressions will throw an exception.
* <p></p>
* Examples:
* <ul>
* <li><code>CLP_GET_STRING('field')</code> → <code>field</code> (as a VariableReferenceExpression)</li>
* <li><code>CLP_GET_INT('field')</code> → <code>field</code> (not mapped to a KQL column)</li>
* </ul>
*
* @param rowExpression the input expression to analyze and possibly rewrite
* @param context a set of VariableReferenceExpressions used for pushdown; newly created ones
* will be added here
* @return a possibly rewritten RowExpression with CLP UDFs replaced
*/
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context)
{
requireNonNull(context);
if (!(rowExpression instanceof CallExpression)) {
return rowExpression;
}

CallExpression callExpression = (CallExpression) rowExpression;

// Recursively process the arguments of this CallExpression
List<RowExpression> newArgs = callExpression.getArguments().stream()
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context))
.collect(ImmutableList.toImmutableList());

FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle());
String functionName = metadata.getName().getObjectName().toUpperCase();

if (functionName.startsWith("CLP_GET")) {
// Replace CLP UDF with VariableReferenceExpression
int numArguments = callExpression.getArguments().size();
if (numArguments == 1) {
RowExpression argument = callExpression.getArguments().get(0);
if (!(argument instanceof ConstantExpression)) {
throw new PrestoException(
CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
"The argument of " + functionName + " must be a ConstantExpression");
}
Optional<String> definition = argument.accept(this, context).getPushDownExpression();
if (definition.isPresent()) {
VariableReferenceExpression newVar = new VariableReferenceExpression(
Optional.empty(),
definition.get(),
callExpression.getType());
context.add(newVar);
assignments.put(newVar, new ClpColumnHandle(definition.get(), callExpression.getType(), true));
return newVar;
}
else {
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, "Unrecognized parameter in " + functionName);
}
}
}

return new CallExpression(callExpression.getDisplayName(), callExpression.getFunctionHandle(), callExpression.getType(), newArgs);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Well-implemented CLP UDF rewriting logic.

The method correctly handles recursive rewriting of CLP UDFs with proper error handling and validation. Good use of ImmutableList for collecting transformed arguments.

Consider extracting the magic string "CLP_GET" as a constant for better maintainability:

+    private static final String CLP_GET_PREFIX = "CLP_GET";
+
     private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context)
     {
         // ... existing code ...
-        if (functionName.startsWith("CLP_GET")) {
+        if (functionName.startsWith(CLP_GET_PREFIX)) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Rewrites CLP UDFs (e.g., CLP_GET_*) in a RowExpression tree into VariableReferenceExpressions,
* enabling them to be used as pushdown filters.
* <p></p>
* Traverses the expression tree recursively, replacing supported CLP UDFs with uniquely named
* variables and tracking these variables in the assignments and context. If the CLP UDF takes
* a constant string argument, that string is used as the new variable name. Unsupported
* argument types (non-constants) or invalid expressions will throw an exception.
* <p></p>
* Examples:
* <ul>
* <li><code>CLP_GET_STRING('field')</code> → <code>field</code> (as a VariableReferenceExpression)</li>
* <li><code>CLP_GET_INT('field')</code> → <code>field</code> (not mapped to a KQL column)</li>
* </ul>
*
* @param rowExpression the input expression to analyze and possibly rewrite
* @param context a set of VariableReferenceExpressions used for pushdown; newly created ones
* will be added here
* @return a possibly rewritten RowExpression with CLP UDFs replaced
*/
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context)
{
requireNonNull(context);
if (!(rowExpression instanceof CallExpression)) {
return rowExpression;
}
CallExpression callExpression = (CallExpression) rowExpression;
// Recursively process the arguments of this CallExpression
List<RowExpression> newArgs = callExpression.getArguments().stream()
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context))
.collect(ImmutableList.toImmutableList());
FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle());
String functionName = metadata.getName().getObjectName().toUpperCase();
if (functionName.startsWith("CLP_GET")) {
// Replace CLP UDF with VariableReferenceExpression
int numArguments = callExpression.getArguments().size();
if (numArguments == 1) {
RowExpression argument = callExpression.getArguments().get(0);
if (!(argument instanceof ConstantExpression)) {
throw new PrestoException(
CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION,
"The argument of " + functionName + " must be a ConstantExpression");
}
Optional<String> definition = argument.accept(this, context).getPushDownExpression();
if (definition.isPresent()) {
VariableReferenceExpression newVar = new VariableReferenceExpression(
Optional.empty(),
definition.get(),
callExpression.getType());
context.add(newVar);
assignments.put(newVar, new ClpColumnHandle(definition.get(), callExpression.getType(), true));
return newVar;
}
else {
throw new PrestoException(CLP_PUSHDOWN_UNSUPPORTED_EXPRESSION, "Unrecognized parameter in " + functionName);
}
}
}
return new CallExpression(callExpression.getDisplayName(), callExpression.getFunctionHandle(), callExpression.getType(), newArgs);
}
// extract magic prefix into a constant
+ private static final String CLP_GET_PREFIX = "CLP_GET";
private RowExpression maybeReplaceClpUdfArgument(RowExpression rowExpression, Set<VariableReferenceExpression> context)
{
requireNonNull(context);
if (!(rowExpression instanceof CallExpression)) {
return rowExpression;
}
CallExpression callExpression = (CallExpression) rowExpression;
// Recursively process the arguments of this CallExpression
List<RowExpression> newArgs = callExpression.getArguments().stream()
.map(childArg -> maybeReplaceClpUdfArgument(childArg, context))
.collect(ImmutableList.toImmutableList());
FunctionMetadata metadata = functionMetadataManager.getFunctionMetadata(callExpression.getFunctionHandle());
String functionName = metadata.getName().getObjectName().toUpperCase();
- if (functionName.startsWith("CLP_GET")) {
+ if (functionName.startsWith(CLP_GET_PREFIX)) {
// Replace CLP UDF with VariableReferenceExpression
int numArguments = callExpression.getArguments().size();
if (numArguments == 1) {
RowExpression argument = callExpression.getArguments().get(0);
🤖 Prompt for AI Agents
In
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFilterToKqlConverter.java
between lines 217 and 281, the string "CLP_GET" is used directly in the code to
identify CLP UDF functions. Extract this magic string into a private static
final constant at the class level, then replace all occurrences of the literal
"CLP_GET" with this constant to improve maintainability and avoid duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the metadata filter optional.

1 participant