Skip to content

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Oct 10, 2025

Description

This PR exposes two previously private JSON serialization methods by making them public and updates generate_json_string to accept a message_index parameter and return the serialized string. The internal serialization logic has been modified to pass this index, enabling external callers to generate JSON strings for specific messages.

The main goal is to support CLP_GET_JSON_STRING() in Presto/Velox integration, allowing users to retrieve serialized JSON representations of messages.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

  • New Features

    • Generate JSON for a specific message by index, improving precision when exporting or inspecting individual entries.
    • JSON is now returned directly from the generation call, simplifying retrieval and integration with downstream tools.
  • Refactor

    • Streamlined JSON generation flow for consistent behaviour across numbers, booleans, strings, arrays and formatted values.
    • Serializer initialization is now exposed for clearer setup control.

@wraymo wraymo requested review from a team and gibber9809 as code owners October 10, 2025 17:16
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Walkthrough

Refactors SchemaReader JSON generation to accept a message_index and return a std::string; propagates message_index through all value extraction and serialization calls; exposes initialize_serializer() as public and removes the prior private no-arg variants; updates call sites to consume the returned string.

Changes

Cohort / File(s) Summary
SchemaReader API & implementation
components/core/src/clp_s/SchemaReader.hpp, components/core/src/clp_s/SchemaReader.cpp
Change SchemaReader::generate_json_string() to auto generate_json_string(uint64_t message_index) -> std::string; update internal logic to use message_index instead of m_cur_message; return the serialized JSON string instead of reading it from the serializer afterwards; expose initialize_serializer() as public and remove the previous private no-arg variants; update call sites to pass message_index and use the returned string.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant SR as SchemaReader
  participant JS as JsonSerializer

  Caller->>SR: initialize_serializer()
  Note right of SR #DDEEFF: Serializer setup (public)

  Caller->>SR: generate_json_string(message_index)
  activate SR
  SR->>SR: Extract values using message_index
  SR->>JS: Add fields/values for message_index
  JS-->>SR: Serialized string
  SR-->>Caller: std::string (JSON)
  deactivate SR
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The pull request title "feat(clp-s): Expose JSON serialization in SchemaReader." directly and accurately reflects the main architectural change in the changeset. The summary confirms that the PR exposes two previously private JSON serialization methods (generate_json_string and initialize_serializer) by making them public, along with refactoring generate_json_string to accept a message_index parameter and return a serialized JSON string. The title is concise, uses clear action-oriented language (Expose), and is specific enough that a teammate scanning the repository history would understand the primary change without ambiguity.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e5ab12 and 4a8fbfb.

📒 Files selected for processing (2)
  • components/core/src/clp_s/SchemaReader.cpp (8 hunks)
  • components/core/src/clp_s/SchemaReader.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/SchemaReader.cpp
  • components/core/src/clp_s/SchemaReader.hpp
🧬 Code graph analysis (1)
components/core/src/clp_s/SchemaReader.cpp (2)
components/core/src/clp_s/SchemaReader.hpp (4)
  • message_index (162-162)
  • message (169-169)
  • message (177-177)
  • message (187-192)
components/core/src/clp_s/JsonSerializer.hpp (4)
  • column (133-136)
  • column (133-133)
  • column (138-143)
  • column (139-139)
🔇 Additional comments (6)
components/core/src/clp_s/SchemaReader.hpp (2)

157-162: LGTM! Well-documented public API.

The new signature with message_index parameter and std::string return type enables external callers to generate JSON for specific messages. The documentation clearly describes the parameter and return value.


207-210: LGTM! Public initializer exposed.

Making initialize_serializer public allows external callers to prepare the serialization state before calling generate_json_string.

components/core/src/clp_s/SchemaReader.cpp (4)

103-177: LGTM! Consistent parameter usage.

All value extractions correctly use message_index instead of m_cur_message, enabling external callers to serialize arbitrary message indices.


193-193: LGTM! Proper return value.

Returning the serialized string enables external callers to retrieve the JSON without direct access to internal state.


204-204: LGTM! Call sites correctly updated.

All internal call sites properly assign the returned string and pass m_cur_message as the message index argument.

Also applies to: 225-225, 257-257


66-194: Thread-safety: Protect shared serializer.
m_json_serializer is a mutable member used across calls. Confirm whether generate_json_string may be invoked concurrently; if so, serialize access with a mutex or switch to a thread-local serializer.

}

void SchemaReader::generate_json_string() {
std::string SchemaReader::generate_json_string(uint64_t message_index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add bounds validation for message_index.

The function doesn't validate that message_index < m_num_messages. Out-of-bounds access in column readers could cause undefined behaviour or crashes.

Apply this diff to add bounds checking:

 std::string SchemaReader::generate_json_string(uint64_t message_index) {
+    if (message_index >= m_num_messages) {
+        throw OperationFailed(ErrorCodeBadParam, __FILENAME__, __LINE__);
+    }
     m_json_serializer.reset();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::string SchemaReader::generate_json_string(uint64_t message_index) {
std::string SchemaReader::generate_json_string(uint64_t message_index) {
if (message_index >= m_num_messages) {
throw OperationFailed(ErrorCodeBadParam, __FILENAME__, __LINE__);
}
m_json_serializer.reset();
//
}
🤖 Prompt for AI Agents
In components/core/src/clp_s/SchemaReader.cpp around line 66, add a guard that
validates message_index is within bounds (message_index < m_num_messages) before
using it; if the check fails, throw a std::out_of_range with a clear message (or
otherwise return an error) to prevent out-of-bounds access in column readers,
and add #include <stdexcept> to the file if needed.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

Also, could we do a quick benchmark to confirm there aren't performance regressions? I don't expect this change to affect performance, but still worth checking.

Comment on lines 158 to 162
* Generates a json string from the extracted values
* @param message_index the index of the message to generate the json string for
* @return the generated json string
*/
std::string generate_json_string(uint64_t message_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Generates a json string from the extracted values
* @param message_index the index of the message to generate the json string for
* @return the generated json string
*/
std::string generate_json_string(uint64_t message_index);
* Generates a json string from the encoded columns.
* @param message_index The index of the message to generate the JSON string for.
* @return The generated JSON string.
*/
[[nodiscard]] auto generate_json_string(uint64_t message_index) -> std::string;

Updating the comment a bit since I'm pretty sure that extracted values line refers to a very old implementation + updating the prototype to comply with our current standard since we're touching this anyway.

}

void SchemaReader::generate_json_string() {
std::string SchemaReader::generate_json_string(uint64_t message_index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string SchemaReader::generate_json_string(uint64_t message_index) {
auto SchemaReader::generate_json_string(uint64_t message_index) -> std::string {

Same as other comment.

@wraymo wraymo requested a review from gibber9809 October 15, 2025 12:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp_s/SchemaReader.cpp (1)

66-66: Add bounds validation for message_index.

The function doesn't validate that message_index < m_num_messages. Out-of-bounds access in column readers could cause undefined behaviour or crashes.

Apply this diff to add bounds checking:

 auto SchemaReader::generate_json_string(uint64_t message_index) -> std::string {
+    if (message_index >= m_num_messages) {
+        throw OperationFailed(ErrorCodeBadParam, __FILENAME__, __LINE__);
+    }
     m_json_serializer.reset();
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8fbfb and 8096c93.

📒 Files selected for processing (2)
  • components/core/src/clp_s/SchemaReader.cpp (8 hunks)
  • components/core/src/clp_s/SchemaReader.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/SchemaReader.cpp
  • components/core/src/clp_s/SchemaReader.hpp
🧬 Code graph analysis (1)
components/core/src/clp_s/SchemaReader.cpp (2)
components/core/src/clp_s/SchemaReader.hpp (5)
  • message_index (162-162)
  • message (169-169)
  • message (177-177)
  • message (187-192)
  • m_cur_message (248-248)
components/core/src/clp_s/JsonSerializer.hpp (4)
  • column (133-136)
  • column (133-133)
  • column (138-143)
  • column (139-139)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
🔇 Additional comments (7)
components/core/src/clp_s/SchemaReader.hpp (1)

157-162: LGTM!

The method signature and documentation are clear and appropriate. The [[nodiscard]] attribute correctly ensures the returned string isn't ignored, and the trailing return type syntax aligns with modern C++ standards.

components/core/src/clp_s/SchemaReader.cpp (6)

98-158: LGTM!

All value extractions correctly use the message_index parameter instead of m_cur_message. The changes are consistent across integer, float, and boolean operations.


130-180: LGTM!

All serializer append operations correctly propagate the message_index parameter to the column readers. The changes maintain consistency across formatted floats, strings, and arrays.


192-194: LGTM!

The return statement correctly returns the serialized JSON string, making the API more intuitive and explicit.


196-212: LGTM!

The call site correctly uses the returned string from generate_json_string(m_cur_message), simplifying the code and improving readability.


214-237: LGTM!

The filtered message retrieval correctly uses the returned string from generate_json_string(m_cur_message), maintaining consistency with other call sites.


239-272: LGTM!

The message retrieval with metadata correctly uses the returned string from generate_json_string(m_cur_message), completing the refactor across all call sites.

void initialize_filter_with_column_map(FilterClass* filter);

/**
* Initializes all internal data structured required to serialize records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Initializes all internal data structured required to serialize records.
* Initializes all internal data structures required to serialize records.

Just explicitly adding my other comment as an edit suggestion. After this should be good to merge.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8096c93 and 378b418.

📒 Files selected for processing (1)
  • components/core/src/clp_s/SchemaReader.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/SchemaReader.hpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (1)
components/core/src/clp_s/SchemaReader.hpp (1)

207-210: LGTM!

The typo fix from "structured" to "structures" has been applied as requested in previous reviews. Making this method public aligns with the PR objective of supporting external JSON serialization.

Comment on lines +157 to +162
/**
* Generates a JSON string from the encoded columns
* @param message_index The index of the message to generate the JSON string for.
* @return The generated JSON string
*/
[[nodiscard]] auto generate_json_string(uint64_t message_index) -> std::string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify bounds checking and consider documenting preconditions.

The method signature looks correct and uses appropriate modern C++ idioms ([[nodiscard]], trailing return type). However, since this is now a public API that accepts an arbitrary message_index, consider the following:

  1. The implementation should validate that message_index < m_num_messages to prevent out-of-bounds access.
  2. The documentation could be enhanced to specify:
    • Precondition: message_index must be less than get_num_messages()
    • What happens on invalid input (throws exception, returns empty string, etc.)
    • Whether the serializer must be initialized first via initialize_serializer()

Run the following script to verify bounds checking is implemented:


🏁 Script executed:

#!/bin/bash
# Description: Verify that generate_json_string performs bounds checking on message_index

# Search for the implementation and check for bounds validation
ast-grep --pattern $'auto SchemaReader::generate_json_string($_, message_index) -> std::string {
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Locate generate_json_string implementation and check for bounds validation
rg -nP "SchemaReader::generate_json_string\s*\(.*\)\s*->\s*std::string" -n -C5 --type=cpp

Length of output: 878


🏁 Script executed:

#!/bin/bash
# Display the full body of generate_json_string to check for bounds validation
rg -nP "auto SchemaReader::generate_json_string" -n -A100 components/core/src/clp_s/SchemaReader.cpp

Length of output: 5090


Validate message_index in generate_json_string and document preconditions

  • In SchemaReader.cpp (line 66), add a check if (message_index >= get_num_messages()) throw OperationFailed(ErrorCodeInvalidArgument, __FILENAME__, __LINE__); to prevent out-of-bounds access.
  • Update the header doc (SchemaReader.hpp lines 157–162) to specify that message_index must be < get_num_messages(), describe the exception on invalid input, and note that initialize_serializer() must be called first.
🤖 Prompt for AI Agents
In components/core/src/clp_s/SchemaReader.hpp around lines 157–162, update the
method documentation for generate_json_string to state that message_index must
be < get_num_messages(), that initialize_serializer() must have been called
before invoking this method, and that the method throws
OperationFailed(ErrorCodeInvalidArgument, __FILENAME__, __LINE__) on invalid
message_index; and in components/core/src/clp_s/SchemaReader.cpp at ~line 66,
add a runtime check: if message_index >= get_num_messages() then throw
OperationFailed(ErrorCodeInvalidArgument, __FILENAME__, __LINE__) to prevent
out-of-bounds access before proceeding.

@wraymo wraymo requested a review from gibber9809 October 16, 2025 15:33
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Did some quick benchmarking, and this seems to slow down decompression by ~1.5%, but that should be okay for now.

For title how about feat(clp-s): Expose JSON serialization in `SchemaReader`.

@wraymo wraymo changed the title feat: Add external JSON serialization support feat(clp-s): Expose JSON serialization in SchemaReader. Oct 16, 2025
@jackluo923 jackluo923 merged commit 9e991ab into y-scope:main Oct 16, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants