Skip to content

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Oct 17, 2025

Description

This PR adds support for CLP JSON string serialization for archives, enabling the clp_get_json_string UDF to return serialized JSON strings in Presto.

Specifically:

  1. Introduces __json_string to represent columns that require serialized JSON output.
  2. Excludes such columns from the projected columns list while recording their column indices for later processing.
  3. Adds a dedicated ClpArchiveJsonStringVectorLoader to initialize the JSON serializer and perform on-demand serialization when creating vectors for these columns.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Did an end-to-end testing with a sample dataset

presto:default> select timestamp, clp_get_json_string() from default where clp_wildcard_string_column() in ('profile_update', 'logout');
      timestamp       |                                                                          _col1                                                                           
----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------
 2025-05-13T15:00:20Z | {"timestamp":"2025-05-13T15:00:20Z","level":"INFO","service":"user-service","event":{"action":"logout"},"user":{"id":"12345"}}                           
 2025-05-13T15:00:06Z | {"timestamp":"2025-05-13T15:00:06Z","level":"INFO","service":"user-service","event":{"action":"profile_update"},"user":{"id":"67890","name":"Jane Doe"}} 
 2025-05-13T15:00:20Z | {"timestamp":"2025-05-13T15:00:20Z","level":"INFO","service":"user-service","event":{"action":"logout"},"user":{"id":"12345"}}                           
 2025-05-13T15:00:06Z | {"timestamp":"2025-05-13T15:00:06Z","level":"INFO","service":"user-service","event":{"action":"profile_update"},"user":{"id":"67890","name":"Jane Doe"}} 
(4 rows)

Query 20251018_182705_00001_ikv76, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
[Latency: client-side: 225ms, server-side: 191ms] [6 rows, 0B] [31 rows/s, 0B/s]

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for JSON string column projection in CLP archive searches with efficient lazy-loading capabilities.
  • Bug Fixes

    • Improved column projection validation and handling during vector creation.
  • Tests

    • Added comprehensive test coverage for JSON string column functionality.
  • Chores

    • Updated CLP dependency to a newer version.

Copy link

coderabbitai bot commented Oct 17, 2025

Walkthrough

This pull request introduces support for JSON string columns in the CLP connector. It refactors column projection handling to detect and lazily load __json_string columns via a new ClpArchiveJsonStringVectorLoader class, updates the CLP dependency to a newer commit, adjusts public header dependencies in CMake, and adds test cases for JSON string functionality.

Changes

Cohort / File(s) Summary
Dependency Update
CMake/resolve_dependency_modules/clp.cmake
Updates CLP FetchContent GIT_TAG from 67276c09acbd48dd502454288f40072c44628726 to 9e991ab49681faff0aea259a37b489c3f416e06e to fetch a newer revision.
Build Configuration
velox/connectors/clp/search_lib/archive/CMakeLists.txt
Removes public header dependencies (ClpArchiveCursor.h, ClpArchiveVectorLoader.h, ClpQueryRunner.h) from target sources and adds ClpArchiveJsonStringVectorLoader.cpp as a new source file.
Cursor Refactoring
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h, velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
Introduces per-column indexing (columnIndex_, projectedColumnIndex_) replacing single reader index; adds schemaReader_ member and jsonStringColumnIndices_ tracking; detects __json_string columns during projection setup and implements lazy loading path via createVectorHelper; reworks projection validation and vector creation logic.
JSON String Loader
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h, velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp
Adds new VectorLoader class that generates JSON strings on-demand from a SchemaReader for filtered row indices; constructor accepts SchemaReader pointer and filtered row indices; loadInternal method iterates rows, generates JSON per message index, and populates string vector.
Test Coverage
velox/connectors/clp/tests/ClpConnectorTest.cpp
Adds test1JsonString and test2JsonSring test cases validating JSON string projection, filtering, and result verification against test archives.

Sequence Diagram

sequenceDiagram
    participant User as Query Executor
    participant Cursor as ClpArchiveCursor
    participant Loader as ClpArchiveJsonStringVectorLoader
    participant Reader as SchemaReader

    User->>Cursor: loadSplit(projection)
    activate Cursor
    Cursor->>Cursor: detect __json_string columns
    Cursor->>Cursor: record indices in jsonStringColumnIndices_
    Note over Cursor: Track per-column indices<br/>(columnIndex_, projectedColumnIndex_)
    deactivate Cursor

    User->>Cursor: createVectorHelper(columnIndex_)
    activate Cursor
    Cursor->>Cursor: check if columnIndex_ in jsonStringColumnIndices_
    alt __json_string column detected
        Cursor->>Loader: new ClpArchiveJsonStringVectorLoader(schemaReader_, filteredRowIndices)
        activate Loader
        Cursor->>Cursor: return LazyVector(loader)
        deactivate Loader
    else regular column
        Cursor->>Reader: retrieve column data
        Cursor->>Cursor: return populated Vector
    end
    deactivate Cursor

    User->>Loader: loadInternal(rowSet)
    activate Loader
    loop for each filtered row index
        Loader->>Reader: generateJsonString(messageIndex)
        Reader-->>Loader: json_string
        Loader->>Loader: assign to StringVector position
    end
    deactivate Loader
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes span multiple interdependent files with new logic for JSON string detection and lazy loading, but follow a cohesive pattern. The refactoring of column indexing in ClpArchiveCursor and introduction of the new ClpArchiveJsonStringVectorLoader class require careful review of the control flow and integration points, though the overall scope is focused on a single feature.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: Add support for CLP JSON string serialization for archives" directly aligns with the main objective and all substantive changes in the changeset. The changes implement a complete feature for supporting JSON string serialization through the introduction of a new ClpArchiveJsonStringVectorLoader class, modifications to projection handling in ClpArchiveCursor to track and exclude __json_string marker columns, and test cases validating the new functionality. The title is specific, clear, and accurately conveys the primary change without vagueness or misleading information.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wraymo wraymo changed the title add clp_get_json_string support feat: Add support for CLP JSON string serialization Oct 18, 2025
@wraymo wraymo changed the title feat: Add support for CLP JSON string serialization feat: Add support for CLP JSON string serialization for archives Oct 18, 2025
@wraymo wraymo marked this pull request as ready for review October 18, 2025 18:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf89596 and f4632fd.

📒 Files selected for processing (7)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • velox/connectors/clp/search_lib/archive/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (7 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
🧬 Code graph analysis (5)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (2)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (3)
  • ClpArchiveJsonStringVectorLoader (25-30)
  • loadInternal (32-47)
  • loadInternal (32-36)
velox/vector/LazyVector.h (1)
  • VectorLoader (177-210)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (5)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • vector (76-78)
  • ClpQueryRunner (43-88)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
velox/python/type/type.pyi (1)
  • VARCHAR (29-29)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (1)
  • ClpArchiveJsonStringVectorLoader (29-44)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
  • projectedColumnIndex_ (69-69)
  • columnIndex_ (68-68)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (1)
  • vector (76-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (18)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)

19-19: LGTM!

The <unordered_set> include is necessary for the jsonStringColumnIndices_ member.


25-25: LGTM!

Forward declaration is appropriate. Based on learnings, the type is available through transitive inclusion when needed.


60-70: LGTM!

The new members properly support the per-column indexing scheme for JSON string handling. The renaming from readerIndex_ to columnIndex_ improves clarity.

velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)

18-18: LGTM!

The new JSON string vector loader source is correctly added to the build.

velox/connectors/clp/tests/ClpConnectorTest.cpp (2)

210-261: LGTM!

The test comprehensively validates JSON string serialization with proper assertions for the __json_string column alongside regular fields.


441-475: LGTM!

The test properly validates JSON string serialization with KQL pushdown filters and verifies the complete JSON payload structure.

velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (2)

25-30: LGTM!

The constructor properly initializes members and sets up the serializer.


41-46: Since clp_s is an external library, let me search for documentation about it and the return type of generate_json_string:

Let me search specifically for the clp_s library source code and the SchemaReader class:

Based on my investigation, I was unable to locate the exact source code for clp_s::SchemaReader::generate_json_string to verify the return type and lifetime guarantees. While I confirmed that clp_s is part of YScope's open-source Compressed Log Processor library, the specific method definition is not accessible in the web search results or the repository files I checked.

The original review comment raises a legitimate concern about StringView lifetime safety that requires verification from someone with access to the clp_s library documentation or source code.

StringView lifetime safety must be verified.

Line 44 constructs a StringView from the return value of schemaReader_->generate_json_string(messageIndex). Confirm that this method returns a string with appropriate lifetime (managed buffer in schemaReader_) rather than a temporary or reference, to avoid dangling pointers when StringView is stored in the vector.

velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (1)

31-43: LGTM!

The class interface is well-designed with appropriate encapsulation. The constructor signature matches the implementation, and the loadInternal override follows the VectorLoader contract.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)

24-24: LGTM!

Including the new JSON string vector loader header is necessary for the lazy loading implementation below.


49-50: LGTM!

Resetting both column indices at the start of fetchNext ensures proper state for the next batch.


74-76: LGTM!

The refactored approach using schemaReader_ and initialize_filter_with_column_map properly integrates the schema reader for JSON string generation.


104-104: LGTM!

The validation correctly accounts for JSON string columns that are handled separately from the projected columns.


154-159: LGTM!

The detection of __json_string columns and recording their indices is clean and correctly skips further processing for these special columns.


247-256: LGTM!

Pre-sizing the children vector with reserve and using explicit numChildren improves efficiency by avoiding reallocations during vector construction.


261-271: LGTM!

The lazy loading path for JSON string columns is well-implemented, using the new ClpArchiveJsonStringVectorLoader to generate JSON strings on demand.


273-286: LGTM!

The updated column access pattern correctly uses projectedColumnIndex_ to index into projectedColumns and retrieves the corresponding type from outputColumns_.

CMake/resolve_dependency_modules/clp.cmake (1)

19-19: Commit 9e991ab49681faff0aea259a37b489c3f416e06e exists and is accessible in the CLP repository.

Verification confirms the commit is available and publicly accessible via both direct git fetch and GitHub API. The commit has been successfully integrated into the y-scope/clp repository.


#pragma once

#include <simdjson.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove unused include.

The <simdjson.h> header is included but doesn't appear to be used in this file. Remove it to reduce unnecessary dependencies.

Apply this diff:

-#include <simdjson.h>
-
 #include "clp_s/SchemaReader.hpp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <simdjson.h>
#include "clp_s/SchemaReader.hpp"
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h
around line 19, the header <simdjson.h> is included but unused; remove that
include line to eliminate the unnecessary dependency and any related
compile-time overhead, then run a build or include-graph check to confirm no
compilation breakage.

test::assertEqualVectors(expected, irOutput);
}

TEST_F(ClpConnectorTest, test2JsonSring) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the typo in the test name.

The test name has a typo: "test2JsonSring" should be "test2JsonString" for consistency with the other test.

Apply this diff:

-TEST_F(ClpConnectorTest, test2JsonSring) {
+TEST_F(ClpConnectorTest, test2JsonString) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TEST_F(ClpConnectorTest, test2JsonSring) {
TEST_F(ClpConnectorTest, test2JsonString) {
🤖 Prompt for AI Agents
In velox/connectors/clp/tests/ClpConnectorTest.cpp around line 440, rename the
test identifier from TEST_F(ClpConnectorTest, test2JsonSring) to
TEST_F(ClpConnectorTest, test2JsonString) to fix the typo so the test name
matches the intended "test2JsonString"; update only the test name token and
leave the rest of the test body unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant