-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add support for CLP JSON string serialization for archives #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: presto-0.293-clp-connector
Are you sure you want to change the base?
feat: Add support for CLP JSON string serialization for archives #40
Conversation
WalkthroughThis pull request introduces support for JSON string columns in the CLP connector. It refactors column projection handling to detect and lazily load Changes
Sequence DiagramsequenceDiagram
participant User as Query Executor
participant Cursor as ClpArchiveCursor
participant Loader as ClpArchiveJsonStringVectorLoader
participant Reader as SchemaReader
User->>Cursor: loadSplit(projection)
activate Cursor
Cursor->>Cursor: detect __json_string columns
Cursor->>Cursor: record indices in jsonStringColumnIndices_
Note over Cursor: Track per-column indices<br/>(columnIndex_, projectedColumnIndex_)
deactivate Cursor
User->>Cursor: createVectorHelper(columnIndex_)
activate Cursor
Cursor->>Cursor: check if columnIndex_ in jsonStringColumnIndices_
alt __json_string column detected
Cursor->>Loader: new ClpArchiveJsonStringVectorLoader(schemaReader_, filteredRowIndices)
activate Loader
Cursor->>Cursor: return LazyVector(loader)
deactivate Loader
else regular column
Cursor->>Reader: retrieve column data
Cursor->>Cursor: return populated Vector
end
deactivate Cursor
User->>Loader: loadInternal(rowSet)
activate Loader
loop for each filtered row index
Loader->>Reader: generateJsonString(messageIndex)
Reader-->>Loader: json_string
Loader->>Loader: assign to StringVector position
end
deactivate Loader
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The changes span multiple interdependent files with new logic for JSON string detection and lazy loading, but follow a cohesive pattern. The refactoring of column indexing in ClpArchiveCursor and introduction of the new ClpArchiveJsonStringVectorLoader class require careful review of the control flow and integration points, though the overall scope is focused on a single feature. Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
CMake/resolve_dependency_modules/clp.cmake
(1 hunks)velox/connectors/clp/search_lib/archive/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(7 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(2 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
🧬 Code graph analysis (5)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (2)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (3)
ClpArchiveJsonStringVectorLoader
(25-30)loadInternal
(32-47)loadInternal
(32-36)velox/vector/LazyVector.h (1)
VectorLoader
(177-210)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (5)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)vector
(76-78)ClpQueryRunner
(43-88)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
velox/python/type/type.pyi (1)
VARCHAR
(29-29)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (1)
ClpArchiveJsonStringVectorLoader
(29-44)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
projectedColumnIndex_
(69-69)columnIndex_
(68-68)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (1)
vector
(76-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (18)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
19-19
: LGTM!The
<unordered_set>
include is necessary for thejsonStringColumnIndices_
member.
25-25
: LGTM!Forward declaration is appropriate. Based on learnings, the type is available through transitive inclusion when needed.
60-70
: LGTM!The new members properly support the per-column indexing scheme for JSON string handling. The renaming from
readerIndex_
tocolumnIndex_
improves clarity.velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)
18-18
: LGTM!The new JSON string vector loader source is correctly added to the build.
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
210-261
: LGTM!The test comprehensively validates JSON string serialization with proper assertions for the
__json_string
column alongside regular fields.
441-475
: LGTM!The test properly validates JSON string serialization with KQL pushdown filters and verifies the complete JSON payload structure.
velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.cpp (2)
25-30
: LGTM!The constructor properly initializes members and sets up the serializer.
41-46
: Since clp_s is an external library, let me search for documentation about it and the return type ofgenerate_json_string
:Let me search specifically for the clp_s library source code and the SchemaReader class:
Based on my investigation, I was unable to locate the exact source code for
clp_s::SchemaReader::generate_json_string
to verify the return type and lifetime guarantees. While I confirmed that clp_s is part of YScope's open-source Compressed Log Processor library, the specific method definition is not accessible in the web search results or the repository files I checked.The original review comment raises a legitimate concern about StringView lifetime safety that requires verification from someone with access to the clp_s library documentation or source code.
StringView lifetime safety must be verified.
Line 44 constructs a
StringView
from the return value ofschemaReader_->generate_json_string(messageIndex)
. Confirm that this method returns a string with appropriate lifetime (managed buffer inschemaReader_
) rather than a temporary or reference, to avoid dangling pointers whenStringView
is stored in the vector.velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h (1)
31-43
: LGTM!The class interface is well-designed with appropriate encapsulation. The constructor signature matches the implementation, and the
loadInternal
override follows theVectorLoader
contract.velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)
24-24
: LGTM!Including the new JSON string vector loader header is necessary for the lazy loading implementation below.
49-50
: LGTM!Resetting both column indices at the start of
fetchNext
ensures proper state for the next batch.
74-76
: LGTM!The refactored approach using
schemaReader_
andinitialize_filter_with_column_map
properly integrates the schema reader for JSON string generation.
104-104
: LGTM!The validation correctly accounts for JSON string columns that are handled separately from the projected columns.
154-159
: LGTM!The detection of
__json_string
columns and recording their indices is clean and correctly skips further processing for these special columns.
247-256
: LGTM!Pre-sizing the children vector with
reserve
and using explicitnumChildren
improves efficiency by avoiding reallocations during vector construction.
261-271
: LGTM!The lazy loading path for JSON string columns is well-implemented, using the new
ClpArchiveJsonStringVectorLoader
to generate JSON strings on demand.
273-286
: LGTM!The updated column access pattern correctly uses
projectedColumnIndex_
to index intoprojectedColumns
and retrieves the corresponding type fromoutputColumns_
.CMake/resolve_dependency_modules/clp.cmake (1)
19-19
: Commit 9e991ab49681faff0aea259a37b489c3f416e06e exists and is accessible in the CLP repository.Verification confirms the commit is available and publicly accessible via both direct git fetch and GitHub API. The commit has been successfully integrated into the y-scope/clp repository.
|
||
#pragma once | ||
|
||
#include <simdjson.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Remove unused include.
The <simdjson.h>
header is included but doesn't appear to be used in this file. Remove it to reduce unnecessary dependencies.
Apply this diff:
-#include <simdjson.h>
-
#include "clp_s/SchemaReader.hpp"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#include <simdjson.h> | |
#include "clp_s/SchemaReader.hpp" |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/archive/ClpArchiveJsonStringVectorLoader.h
around line 19, the header <simdjson.h> is included but unused; remove that
include line to eliminate the unnecessary dependency and any related
compile-time overhead, then run a build or include-graph check to confirm no
compilation breakage.
test::assertEqualVectors(expected, irOutput); | ||
} | ||
|
||
TEST_F(ClpConnectorTest, test2JsonSring) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the typo in the test name.
The test name has a typo: "test2JsonSring" should be "test2JsonString" for consistency with the other test.
Apply this diff:
-TEST_F(ClpConnectorTest, test2JsonSring) {
+TEST_F(ClpConnectorTest, test2JsonString) {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
TEST_F(ClpConnectorTest, test2JsonSring) { | |
TEST_F(ClpConnectorTest, test2JsonString) { |
🤖 Prompt for AI Agents
In velox/connectors/clp/tests/ClpConnectorTest.cpp around line 440, rename the
test identifier from TEST_F(ClpConnectorTest, test2JsonSring) to
TEST_F(ClpConnectorTest, test2JsonString) to fix the typo so the test name
matches the intended "test2JsonString"; update only the test name token and
leave the rest of the test body unchanged.
Description
This PR adds support for CLP JSON string serialization for archives, enabling the
clp_get_json_string
UDF to return serialized JSON strings in Presto.Specifically:
__json_string
to represent columns that require serialized JSON output.ClpArchiveJsonStringVectorLoader
to initialize the JSON serializer and perform on-demand serialization when creating vectors for these columns.Checklist
breaking change.
Validation performed
Did an end-to-end testing with a sample dataset
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores