Skip to content

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Oct 18, 2025

Description

This PR introduces the clp_get_json_string UDF, which is rewritten into a special column named __json_string during UDF rewriting. With Velox support, this UDF enables fetching a serialized JSON string representation of a row.

Note: clp_get_json_string is currently supported only in projection fields.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Did an end-to-end testing with a sample dataset

presto:default> select timestamp, clp_get_json_string() from default where clp_wildcard_string_column() in ('profile_update', 'logout');
      timestamp       |                                                                          _col1                                                                           
----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------
 2025-05-13T15:00:20Z | {"timestamp":"2025-05-13T15:00:20Z","level":"INFO","service":"user-service","event":{"action":"logout"},"user":{"id":"12345"}}                           
 2025-05-13T15:00:06Z | {"timestamp":"2025-05-13T15:00:06Z","level":"INFO","service":"user-service","event":{"action":"profile_update"},"user":{"id":"67890","name":"Jane Doe"}} 
 2025-05-13T15:00:20Z | {"timestamp":"2025-05-13T15:00:20Z","level":"INFO","service":"user-service","event":{"action":"logout"},"user":{"id":"12345"}}                           
 2025-05-13T15:00:06Z | {"timestamp":"2025-05-13T15:00:06Z","level":"INFO","service":"user-service","event":{"action":"profile_update"},"user":{"id":"67890","name":"Jane Doe"}} 
(4 rows)

Query 20251018_182705_00001_ikv76, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
[Latency: client-side: 225ms, server-side: 191ms] [6 rows, 0B] [31 rows/s, 0B/s]

Summary by CodeRabbit

Release Notes

  • New Features

    • Added CLP_GET_JSON_STRING scalar function to extract JSON string data from CLP-compressed columns in Presto queries.
  • Improvements

    • Enhanced query optimization logic for CLP functions to improve performance of JSON string retrieval operations.
  • Tests

    • Added and updated comprehensive tests for CLP function handling and various query optimization scenarios.

Copy link

coderabbitai bot commented Oct 18, 2025

Walkthrough

Adds a new CLP_GET_JSON_STRING scalar function to the Presto CLP plugin and extends the UDF rewriter to handle it specially within projection nodes by generating placeholder variables and maintaining column mappings.

Changes

Cohort / File(s) Summary
New Function Definition
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpFunctions.java
Added new public scalar function clpGetJSONString() with @ScalarFunction annotation, returning VARCHAR (Slice). Body throws UnsupportedOperationException.
UDF Rewriting Logic
presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpUdfRewriter.java
Added JSON_STRING_PLACEHOLDER constant. Extended rewriteClpUdfs() with new inProjectNode boolean parameter. Implemented special handling for CLP_GET_JSON_STRING when inside projection nodes, generating placeholder variables and maintaining synthetic column mappings. Parameter is propagated through recursive calls and passed as false for filter predicates.
Test Coverage
presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpUdfRewriter.java
Renamed three test methods for consistency: testScanFiltertestClpGetScanFilter, testScanProjecttestClpGetScanProject, testScanProjectFiltertestClpGetScanProjectFilter. Added new test testClpGetJsonString() to verify CLP_GET_JSON_STRING() rewriting in projection contexts.

Sequence Diagram

sequenceDiagram
    participant Query as Presto Query
    participant Rewriter as ClpUdfRewriter
    participant Scan as TableScanNode
    participant Proj as ProjectNode
    
    Query->>Rewriter: rewriteClpUdfs(plan, inProjectNode=false)
    
    alt In ProjectNode
        Rewriter->>Proj: Extract project assignments
        Note over Rewriter: inProjectNode = true
        Rewriter->>Rewriter: Process CLP_GET_JSON_STRING()
        Rewriter->>Rewriter: Generate placeholder variable
        Rewriter->>Scan: Map synthetic ClpColumnHandle
        Rewriter-->>Query: Updated plan with placeholder projection
    else In FilterNode
        Note over Rewriter: inProjectNode = false
        Rewriter->>Rewriter: Standard CLP_GET_* handling
        Rewriter-->>Query: Rewritten filter predicate
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description provides a clear explanation of what is being introduced (the clp_get_json_string UDF and its rewriting mechanism) and includes validation with an end-to-end test example. However, the description is missing or incompletely addresses several required template sections: the "Motivation and Context" section does not explain why this UDF is needed or what problem it solves (merely noting Velox support as context), the "Impact" section is absent as a distinct section, the "Test Plan" section is replaced with "Validation performed" (different naming), and critically, the "Release Notes" section is entirely missing—the template explicitly requires either release notes content or a "NO RELEASE NOTE" statement. These omissions represent significant gaps against the specified template requirements. Add a properly structured "Motivation and Context" section that explains the underlying problem this UDF solves and references any related GitHub issues. Add an explicit "Impact" section describing public API changes and any performance implications. Rename "Validation performed" to "Test Plan" to match the template. Most importantly, add a "Release Notes" section following the template guidelines—either include properly formatted release notes or explicitly state "== NO RELEASE NOTE ==" to indicate that none are required for this change.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: Add clp_get_json_string UDF" is fully related to the main changes in the pull request. The changeset introduces a new public scalar function CLP_GET_JSON_STRING in the ClpFunctions class, extends the UDF rewriter to handle this function, and adds corresponding tests. The title is specific, concise, and follows Conventional Commits format, making it clear that a new UDF feature is being added. A developer scanning the history would understand the primary change without ambiguity.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wraymo wraymo marked this pull request as ready for review October 18, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant