Skip to content

Conversation

anlowee
Copy link

@anlowee anlowee commented Sep 10, 2025

Description

This PR is the successor of #29 to handle the timestamp type. We first extracted the timestamp convertion utils from ClpArchiveVectorLoader to share with IR cases for integer and float timestamps. Then we use clp-s's TimestampDictionaryWriter to convert string timestamps to integer then use the util to convert to Velox's timestamp.

This PR also merged some unit test cases which were separated because the timestamp type had not been supported.

This PR also added the docs for IR in velox/docs/develop/connectors.rst.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Passed the CI.

Summary by CodeRabbit

  • New Features

    • IR execution now supports TIMESTAMP columns with robust parsing of ISO 8601 and numeric timestamps across varying precisions.
    • Projected columns can map to multiple source fields, improving flexibility with complex schemas.
  • Documentation

    • Connector docs updated to introduce a base cursor with specialised archive/IR implementations and split-type descriptions.
  • Tests

    • Added IR timestamp tests (with and without pushdown) and new NDJSON test data; IR-path tests streamlined.
  • Chores

    • Timestamp utilities exposed as a shared helper for reuse.

anlowee added 30 commits August 19, 2025 17:56
Copy link

coderabbitai bot commented Sep 10, 2025

Walkthrough

Extracts timestamp utilities into a new public header, expands IR projection to support multiple node IDs per projected column, updates ClpIrVectorLoader constructor and lookup logic, adds timestamp handling in IR loader, updates CMake and docs, and adjusts tests and test data for IR timestamp coverage.

Changes

Cohort / File(s) Summary of Changes
Timestamp utilities extraction
velox/connectors/clp/search_lib/ClpTimestampsUtils.h, velox/connectors/clp/search_lib/CMakeLists.txt, velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
Adds ClpTimestampsUtils.h (precision estimation and conversion APIs) and exposes it via CMake; replaces in-file timestamp helpers in archive loader with the new header.
IR projection: single → multi node IDs; callback update
velox/connectors/clp/search_lib/ir/ClpIrCursor.h, velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp, velox/connectors/clp/search_lib/ir/CMakeLists.txt
Changes projected-column map type from single Node::id_t to std::vector<Node::id_t>; projectionResolutionCallback now accumulates multiple node IDs per key; adjusts related aliases and adds clp_s::search::ast to PRIVATE link libs.
IR loader: constructor/signature & multi-key lookup
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h, velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp
Reorders and expands constructor to accept filtered events first, then isResolved, vector of nodeIds, nodeName, and nodeType; replaces single-node lookup with searching candidate nodeIds[] and selecting the first present value; internal state now uses nodeIds_ and nodeName_.
IR loader: Timestamp column support
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp, velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
Adds explicit handling for Timestamp column type in IR loader, converting double/int64_t inputs to Velox Timestamp via the new timestamp utils; archive loader includes the utils header to remove inline helpers.
Tests: IR path simplification and timestamp coverage
velox/connectors/clp/tests/ClpConnectorTest.cpp, velox/connectors/clp/tests/examples/test_4.ndjson
Removes duplicated IR plan construction in tests and reuses existing plans for IR execution; adds two IR timestamp tests (with and without pushdown); adds NDJSON test data containing ISO8601 and numeric timestamps.
Docs: cursor architecture update
velox/docs/develop/connectors.rst
Documents new BaseClpCursor concept with concrete subclasses (ClpArchiveCursor, ClpIrCursor) and adds a type discriminator to ClpConnectorSplit; updates CLP data source flow wording to reflect the base/specialized cursor architecture.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Split
  participant DS as ClpDataSource
  participant C as BaseClpCursor<br/>(ClpIrCursor)
  participant L as ClpIrVectorLoader
  participant E as FilteredLogEvents

  Note over S,DS: IR split selected
  S->>DS: Provide split (type=IR)
  DS->>C: Create ClpIrCursor
  C->>C: Resolve projected columns<br/>to node ID vectors
  C->>L: ClpIrVectorLoader(filteredEvents,<br/>isResolved, nodeIds[], columnName, columnType)
  L->>E: Iterate events
  rect rgba(200,220,255,0.25)
    Note right of L: Multi-ID lookup
    L->>L: For nodeIds[] find first<br/>present value
  end
  alt Value found
    L-->>C: Append to column vector<br/>(incl. Timestamp via utils)
  else None found
    L-->>C: Skip row
  end
  C-->>DS: Vectors for projected columns
  DS-->>S: Return results
Loading
sequenceDiagram
  autonumber
  participant V as Value
  participant U as ClpTimestampsUtils
  participant T as Velox::Timestamp

  Note over V,U: Timestamp conversion
  V->>U: convertToVeloxTimestamp(value)
  alt double
    U->>U: estimatePrecision(double)
    U->>T: Build seconds + nanos
  else int64_t
    U->>U: estimatePrecision(int64_t)
    U->>T: Normalize to seconds + nanos
  else other
    U-->>V: Error (unsupported)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “feat: Add timestamp type support for kv-ir splits.” succinctly and accurately captures the core enhancement of introducing timestamp handling into key-value IR splits, mirroring the utility extraction, code updates, tests, and documentation changes in this PR. It is focused, clear, and omits extraneous details or vague terminology, allowing reviewers to immediately grasp the main change.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@anlowee anlowee marked this pull request as ready for review September 15, 2025 20:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae5119 and 10e222a.

📒 Files selected for processing (10)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (0 hunks)
  • velox/connectors/clp/search_lib/ir/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.h (2 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (3 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (4 hunks)
  • velox/connectors/clp/tests/examples/test_4.ndjson (1 hunks)
💤 Files with no reviewable changes (1)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
🧬 Code graph analysis (3)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
velox/type/Timestamp.h (1)
  • kMillisecondsInSecond (80-157)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (4)
  • convertToVeloxTimestamp (51-69)
  • convertToVeloxTimestamp (51-51)
  • convertToVeloxTimestamp (71-98)
  • convertToVeloxTimestamp (71-71)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)
velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)
  • readerIndex_ (82-82)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (10)
velox/connectors/clp/search_lib/ir/CMakeLists.txt (2)

25-25: Confirm PUBLIC vs PRIVATE for archive_writer on a STATIC target

Because clp-s-ir-search is STATIC, PUBLIC is often necessary to propagate link dependencies to final executables. If no downstream target needs to directly reference archive_writer symbols (only this static lib does), PRIVATE can reduce leakage. Please confirm transitive link needs and keep PUBLIC only if consumers require it.


32-32: Validate necessity of linking clp_s::search::ast

If this target’s object files don’t reference symbols from clp_s::search::ast (headers only), you can drop this to avoid unnecessary linkage. If it is required indirectly via ClpIrCursor.* types, keep it.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

71-98: LGTM: int64_t→Timestamp covers seconds/ms/µs/ns and negatives correctly

Division/modulo with post‑adjustment for negatives yields canonical (sec, nanos). Nice.

velox/connectors/clp/tests/ClpConnectorTest.cpp (4)

261-267: IR parity assertion added — good

Great to assert IR produces the same result as archive for this plan.


318-324: IR parity assertion added — good

Same here; validates pushdown path parity.


378-385: IR parity for hybrid plan — good

Covers mixed projection with computed filter. Nice.


418-445: Extend test4IrTimestampNoPushdown to cover µs/ns and string timestamps

test_4_ir.clps contains an ISO8601 string timestamp and additional encoded timestamp representations; update velox/connectors/clp/tests/ClpConnectorTest.cpp (test4IrTimestampNoPushdown, ~lines 418–445) to assert all qualifying rows for second/microsecond/nanosecond precisions and add a separate assertion (or test) that exercises the ISO8601 string (dictionary) path.
Shall I implement the test changes to assert sec/µs/ns and add the string-timestamp case?

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

103-123: Prefer std::string_view for dictionary ingestion; avoid raw .data()

Passing .data() drops length and risks embedded NULs — prefer std::string_view (or const std::string&) if the ingest_entry overload supports it.

File: velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (around lines 113–119)

-        } else if (value->is<std::string>()) {
-          auto stringValue = value->get_immutable_view<std::string>().data();
-          uint64_t encodingId{};
-          auto const timestamp = timestampDict_.ingest_entry(
-              nodeName_, nodeId, stringValue, encodingId);
-          timestampVector->set(vectorIndex, convertToVeloxTimestamp(timestamp));
+        } else if (value->is<std::string>()) {
+          const auto& s = value->get_immutable_view<std::string>();
+          std::string_view sv{s.data(), s.size()};
+          uint64_t encodingId{0};
+          auto ts = timestampDict_.ingest_entry(nodeName_, nodeId, sv, encodingId);
+          timestampVector->set(vectorIndex, convertToVeloxTimestamp(ts));
         } else {

I could not find clp_s/TimestampDictionaryWriter.hpp in the repo to confirm ingest_entry's signature — verify it accepts std::string_view; if not, pass the string with length (s, s.size() or s.c_str() plus length) per the actual API.

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)

62-66: LGTM: size invariant check

The check aligns projected node‑id maps with output columns.


193-198: LGTM: multi‑ID lookup wired correctly

Resolution check and capture of per‑column node IDs look correct.

Please confirm projectionResolutionCallback_ always inserts a map entry per output column (even if unresolved) to satisfy the size check above.

@anlowee
Copy link
Author

anlowee commented Sep 15, 2025

Going to fix some of the coderabbiai comments first.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10e222a and 1928748.

📒 Files selected for processing (1)
  • velox/docs/develop/connectors.rst (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
  • GitHub Check: Build

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)

19-27: Missing headers for std::floor and fixed-width integers

This header uses std::floor, int64_t, uint64_t, and uint8_t but doesn’t include /. Add explicit includes to avoid brittle transitive dependencies.

 #include <memory>
 #include <string>
 #include <string_view>
 #include <vector>
+#include <cmath>
+#include <cstdint>
 
 #include "clp_s/InputConfig.hpp"
 #include "velox/connectors/clp/ClpConnectorSplit.h"
 #include "velox/type/Timestamp.h"
♻️ Duplicate comments (6)
velox/connectors/clp/tests/examples/test_4.ndjson (1)

1-3: Add one more ISO‑8601 row to fully exercise the string‑dictionary path

Great start. To validate uniqueness/order and hit the dictionary more than once, add a second string timestamp (later than Line 1) and update the test to assert both are ingested correctly.

Example to append:
{"timestamp": "2025-04-30T08:50:06.500Z"}

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

21-21: Reduce transitive include and avoid eager construction of TimestampDictionaryWriter

Rebuild surface: including TimestampDictionaryWriter in a public header + holding it by value expands transitive dependencies and alloc/cost for non‑Timestamp columns. Forward‑declare and hold it lazily.

Minimal change using std::optional (keeps current usage pattern):

+#include <optional>
-#include "clp_s/TimestampDictionaryWriter.hpp"
+#include "clp_s/TimestampDictionaryWriter.hpp" // keep for now if optional stays in header
...
-  ColumnType nodeType_;
-  clp_s::TimestampDictionaryWriter timestampDict_;
+  ColumnType nodeType_;
+  std::optional<clp_s::TimestampDictionaryWriter> timestampDict_;

Pair with the .cpp change to emplace on first use (see suggested diff in ClpIrVectorLoader.cpp).

If compile times regress, consider a pimpl: forward‑declare the writer here and store std::unique_ptr<...>; define the destructor and all uses in the .cpp.

Also applies to: 50-55

velox/docs/develop/connectors.rst (2)

157-161: Tighten wording and mention the timestamp conversion flow

Clarify concrete subclass instantiation and note the timestamp path added in this PR.

-When a split is added, a ``BaseClpCursor`` instance is created with the split path and input source, which
-may be either a ``ClpArchiveCursor`` or a ``ClpIrCursor``. The query is parsed and simplified into an AST.
+When a split is added, a concrete cursor (``ClpArchiveCursor`` or ``ClpIrCursor``) is created with the split path and input source; both derive from ``BaseClpCursor``. The query is parsed and simplified into an AST. For timestamp columns, string timestamps may be converted to integers via CLP‑s’ ``TimestampDictionaryWriter`` and then to Velox timestamps via shared conversion utilities.

137-140: Spell out allowed values for the new split type

Make the accepted tokens explicit (and case‑sensitive if applicable).

- includes a ``type`` property that specifies whether the split is an archive or an IR (Intermediate Representation)
- stream.
+ includes a ``type`` property that specifies whether the split is an archive or an IR (Intermediate Representation)
+ stream. Allowed values: ``archive`` or ``ir`` (case‑sensitive).
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)

205-209: Good: moved node IDs to avoid a copy

std::move(projectedColumnNodeIds) here is the right optimisation and matches the constructor.


136-140: Include string literal types for timestamp projection

Without VarStringT|ClpStringT, string‑encoded timestamps won’t resolve via the dictionary path.

Apply:

-      case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+      case ColumnType::Timestamp:
+        // IR timestamp supported: accept numeric and string literal forms.
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1928748 and 1982c3b.

📒 Files selected for processing (6)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (2 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
  • velox/connectors/clp/tests/examples/test_4.ndjson (1 hunks)
  • velox/docs/develop/connectors.rst (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
🧬 Code graph analysis (2)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
velox/type/Timestamp.h (2)
  • TimestampPrecision (33-458)
  • kMillisecondsInSecond (80-157)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)
velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)
  • readerIndex_ (82-82)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

36-45: LGTM: iterator/nodeId initialisation fixed

Initialising valueIt to end() and default‑initialising nodeId eliminates UB if no nodeId matches. Good fix.

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

31-44: Move nodeIds into the member (good), keep call sites passing std::move

Using std::move(nodeIds) avoids an extra copy; ensure call sites pass std::move(projectedColumnNodeIds) as done in ClpIrCursor.cpp.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)

19-27: Missing headers for uint8_t and std::floor.

Add for uint8_t and for std::floor to avoid fragile transitive includes.

 #pragma once
 
 #include <memory>
 #include <string>
 #include <string_view>
 #include <vector>
+#include <cstdint>
+#include <cmath>
 
 #include "clp_s/InputConfig.hpp"
 #include "velox/connectors/clp/ClpConnectorSplit.h"
 #include "velox/type/Timestamp.h"
♻️ Duplicate comments (4)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)

136-140: Include string literal types for Timestamp projections and update stale TODO.

IR can carry timestamps as strings; limit to Float|Integer drops them.

-      case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+      case ColumnType::Timestamp:
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;

62-66: Relax the equality check — allow unresolved columns.

Equality trips when some projections intentionally have no node IDs.

-  VELOX_CHECK_EQ(
-      projectedColumnIdxNodeIdsMap_.size(),
-      outputColumns_.size(),
-      "Resolved node-id map size ({}) must not exceed projected columns ({})",
-      projectedColumnIdxNodeIdsMap_.size(),
-      outputColumns_.size());
+  VELOX_CHECK_LE(
+      projectedColumnIdxNodeIdsMap_.size(),
+      outputColumns_.size(),
+      "Resolved node-id map size ({}) must not exceed projected columns ({})",
+      projectedColumnIdxNodeIdsMap_.size(),
+      outputColumns_.size());
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

100-114: Timestamp branch leaves null bit set and lacks string‑timestamp support.

  • You set a value but never clear the null flag.
  • PR intent mentions string → integer conversion via TimestampDictionaryWriter; current code rejects strings.

Apply this diff to clear the null and handle strings (pairs with header change below):

       case ColumnType::Timestamp: {
         auto timestampVector = vector->asFlatVector<Timestamp>();
-        if (value->is<double>()) {
+        if (value->is<double>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<double>()));
-        } else if (value->is<int64_t>()) {
+        } else if (value->is<int64_t>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<int64_t>()));
+        } else if (value->is<std::string>()) {
+          // Lazily create the dictionary the first time we see a string ts.
+          if (!timestampDict_) {
+            timestampDict_ = std::make_unique<clp_s::TimestampDictionaryWriter>();
+          }
+          auto const& stringValue = value->get_immutable_view<std::string>();
+          uint64_t encodingId{};
+          auto const tsInt = timestampDict_->ingest_entry(
+              nodeName_, nodeId, stringValue, encodingId);
+          timestampVector->set(vectorIndex, convertToVeloxTimestamp(tsInt));
         } else {
           VELOX_FAIL("Unsupported timestamp type");
         }
+        vector->setNull(vectorIndex, false);
         break;
       }

And in this .cpp top‑section, include the writer and define the out‑of‑line dtor:

+#include "clp_s/TimestampDictionaryWriter.hpp"
+
+ClpIrVectorLoader::~ClpIrVectorLoader() = default;
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

49-53: Add a writer handle for string timestamps (forward‑declared) to avoid header bloat.

Forward‑declare the writer and store a unique_ptr; define the destructor out‑of‑line in the .cpp that includes the writer header (see paired .cpp diff above).

 namespace facebook::velox::connector::clp::search_lib {
 
 class ClpIrVectorLoader : public VectorLoader {
  public:
+  ~ClpIrVectorLoader() override;
@@
  private:
   simdjson::ondemand::parser arrayParser_;
   std::shared_ptr<
       const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>
       filteredLogEvents_;
   bool isResolved_;
   std::vector<::clp::ffi::SchemaTree::Node::id_t> nodeIds_;
   std::string nodeName_;
   ColumnType nodeType_;
+  // Lazily initialised in Timestamp branch.
+  std::unique_ptr<clp_s::TimestampDictionaryWriter> timestampDict_;

And add a forward declaration at the top of this header:

 #include "velox/connectors/clp/search_lib/BaseClpCursor.h"
 #include "velox/vector/FlatVector.h"
 #include "velox/vector/LazyVector.h"
 
+namespace clp_s { class TimestampDictionaryWriter; }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1982c3b and 947fded.

📒 Files selected for processing (5)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (2 hunks)
  • velox/connectors/clp/search_lib/ir/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/BaseClpCursor.h
🧬 Code graph analysis (2)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)
velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)
  • readerIndex_ (82-82)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
velox/type/Timestamp.h (1)
  • kMillisecondsInSecond (80-157)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (7)
velox/connectors/clp/search_lib/ir/CMakeLists.txt (1)

32-32: Add of clp_s::search::ast looks correct (PRIVATE).

Linking the AST lib is appropriate given IR files reference search::ast types.

Please confirm there are no public headers in this target exposing search::ast symbols; if there are, flip to PUBLIC.

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2)

36-44: Good fix: iterator and nodeId initialisation eliminates UB.

Initialising valueIt to end() and nodeId to {} avoids dereferencing garbage when no node matches.


45-47: Selection precedence relies on nodeIds_ order — document/verify.

You pick the first present nodeId; behaviour depends on nodeIds_ order. Document this precedence and add a test asserting deterministic selection.

Also applies to: 49-49

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

64-70: Timestamp precision utilities look solid.

Naming avoids clashes (InputTimestampPrecision) and the heuristic is reasonable given current formats.

Also applies to: 86-103

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

193-199: Accurate resolved‑state and move semantics — LGTM.

Treating empty vectors as unresolved and moving node IDs into the loader avoids ambiguity and copies.

Also applies to: 206-210

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2)

22-22: Include path correction — LGTM.

Using the absolute project include keeps headers consistent.


33-43: Constructor/state wiring looks good; avoids copies.

Passing nodeIds by value and moving into the member is acceptable given the call site moves it.

Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just a few minor comments.

It also might make sense to move the timestamp heuristic stuff into its own header instead of keeping it in BaseClpCursor.h, but I'll leave that up to you.

Comment on lines 136 to 137
// TODO: IR timestamp support pending; constrain to Unknown to avoid
// mismatched projections.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to delete this comment and/or update it to something mentioning future string timestamp support?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that Presto might only store the timestamp type data either in milliseconds or microseconds, and both of them are numeric actually. If that is the case and we only use velox for Presto, then we might don't need to support this. Let me double check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed. Presto stores the TIMESTAMP as integer or a packed number for TIMESTAMP WITH TIMEZONE. So I expect that the timestamp pushdown passed to Velox will never contain date string. Therefore, the issue here is if the timestamp field in the IR file is actually a string, then it will have issues.

I will just remove this comments when we have a concrete idea to address this difference between IR and SFA. I have an issue to track this: #32

Comment on lines 38 to 44
for (size_t i{0}; i < nodeIds_.size(); ++i) {
valueIt = userGenNodeIdValueMap.find(nodeIds_[i]);
if (valueIt != userGenNodeIdValueMap.end()) {
nodeId = nodeIds_[i];
break;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (size_t i{0}; i < nodeIds_.size(); ++i) {
valueIt = userGenNodeIdValueMap.find(nodeIds_[i]);
if (valueIt != userGenNodeIdValueMap.end()) {
nodeId = nodeIds_[i];
break;
}
}
for (auto const candidateNodeId : nodeIds_) {
valueIt = userGenNodeIdValueMap.find(candidateNodeId);
if (valueIt != userGenNodeIdValueMap.end()) {
nodeId = candidateNodeId;
break;
}
}

Sort of up to taste, but I think this is a bit more readable.

std::shared_ptr<
const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>
filteredLogEvents_;
bool isResolved_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool isResolved_;
bool isResolved_{};

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

179-186: Pre‑reserve children to avoid reallocations.

Micro‑optimisation: reserve space before pushing back.

Apply:

     std::vector<VectorPtr> children;
     auto& rowType = vectorType->as<TypeKind::ROW>();
+    children.reserve(rowType.size());
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 947fded and 03d251d.

📒 Files selected for processing (6)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (4 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (3 hunks)
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp
  • velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h
  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
🧬 Code graph analysis (1)
velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)
velox/type/Timestamp.h (1)
  • kMillisecondsInSecond (80-157)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (9)
velox/connectors/clp/search_lib/CMakeLists.txt (1)

14-23: LGTM!

The addition of ClpTimestampsUtils.h to the public headers is appropriate for the timestamp utility refactoring.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)

17-24: LGTM!

Good refactoring to extract timestamp utilities into a shared header. This promotes code reuse between the archive and IR components.

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2)

38-46: Initialize iterator and nodeId to safe defaults

The variables valueIt and nodeId should be initialized to safe values to prevent undefined behaviour if no matching node ID is found in the loop.

Apply this fix:

-    auto valueIt = userGenNodeIdValueMap.end();
-    ::clp::ffi::SchemaTree::Node::id_t nodeId{};
+    auto valueIt = userGenNodeIdValueMap.end();
+    ::clp::ffi::SchemaTree::Node::id_t nodeId{};

Actually, the code already initializes these correctly. The initialization is appropriate.


102-116: Clear null flag after setting timestamp value

The timestamp case sets values but never clears the null flag that was set at Line 31, leaving all timestamp entries marked as null despite having values.

Apply this fix:

       case ColumnType::Timestamp: {
         auto timestampVector = vector->asFlatVector<Timestamp>();
         if (value->is<double>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<double>()));
         } else if (value->is<int64_t>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<int64_t>()));
         } else {
           VELOX_FAIL("Unsupported timestamp type");
         }
+        vector->setNull(vectorIndex, false);
         break;
       }
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

30-42: Move nodeIds vector to avoid unnecessary copy

The constructor takes nodeIds by value and copies it into the member. Consider moving it instead to avoid the allocation and copy.

Apply this change:

   ClpIrVectorLoader(
       const std::shared_ptr<
           const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>&
           filteredLogEvents,
       bool isResolved,
-      std::vector<::clp::ffi::SchemaTree::Node::id_t> nodeIds,
+      std::vector<::clp::ffi::SchemaTree::Node::id_t>&& nodeIds,
       std::string_view nodeName,
       ColumnType nodeType)
       : filteredLogEvents_(filteredLogEvents),
         isResolved_(isResolved),
         nodeIds_(std::move(nodeIds)),
         nodeName_(nodeName),
         nodeType_(nodeType) {}

Also update the call sites to pass std::move(nodeIds).

velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)

92-118: Integer division incorrect for determining seconds in nanosecond precision

When the precision is nanoseconds, Line 109 calculates seconds = timestamp / precisionDifference where precisionDifference = Timestamp::kNanosInSecond. This is correct.

However, when precision is seconds, Line 106 sets precisionDifference = Timestamp::kNanosInSecond / Timestamp::kNanosInSecond = 1, making Line 109 compute seconds = timestamp / 1 = timestamp, which is correct.

But for microseconds and milliseconds, the calculation appears inverted. The code divides by a smaller number instead of multiplying, which would give incorrect results.

The logic should be:

  • For nanoseconds: divide by 1e9 to get seconds
  • For microseconds: divide by 1e6 to get seconds
  • For milliseconds: divide by 1e3 to get seconds
  • For seconds: value is already in seconds

Apply this fix:

 inline auto convertToVeloxTimestamp(int64_t timestamp) -> Timestamp {
-  int64_t precisionDifference{Timestamp::kNanosInSecond};
+  int64_t divisor{1};
   switch (estimatePrecision(timestamp)) {
     case InputTimestampPrecision::Nanoseconds:
+      divisor = Timestamp::kNanosInSecond;
       break;
     case InputTimestampPrecision::Microseconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosecondsInMicrosecond;
+      divisor = Timestamp::kMicrosecondsInSecond;
       break;
     case InputTimestampPrecision::Milliseconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosecondsInMillisecond;
+      divisor = Timestamp::kMillisecondsInSecond;
       break;
     case InputTimestampPrecision::Seconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosInSecond;
+      divisor = 1;
       break;
   }
-  int64_t seconds{timestamp / precisionDifference};
+  int64_t seconds{timestamp / divisor};
   int64_t nanoseconds{
-      (timestamp % precisionDifference) *
-      (Timestamp::kNanosInSecond / precisionDifference)};
+      (timestamp % divisor) * (Timestamp::kNanosInSecond / divisor)};
   if (nanoseconds < 0) {
     seconds -= 1;
     nanoseconds += Timestamp::kNanosInSecond;
   }
   return Timestamp(seconds, static_cast<uint64_t>(nanoseconds));
 }

Likely an incorrect or invalid review comment.

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3)

194-201: Correctly gate resolution on non‑empty node‑id lists — LGTM.

The isResolved logic now requires a non‑empty vector. This fixes the earlier ambiguity.


62-68: Fix assertion: allow unresolved projected columns (use <=, not ==).

The equality check contradicts the later handling of unresolved columns and will fail legitimately. Use VELOX_CHECK_LE.

Apply:

-  VELOX_CHECK_EQ(
+  VELOX_CHECK_LE(
       projectedColumnIdxNodeIdsMap_.size(),
       outputColumns_.size(),
       "Resolved node-id map size ({}) must not exceed projected columns ({})",
       projectedColumnIdxNodeIdsMap_.size(),
       outputColumns_.size());

137-141: Include string literal types for timestamps (VarStringT, ClpStringT).

IR can carry timestamps as strings; restricting to Float|Integer drops those paths and defeats the dictionary+utility conversion.

Apply:

       case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+        // IR may encode timestamps as numeric (epoch millis/micros) or as strings;
+        // accept both numeric and string literal types so conversion can occur.
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;
         break;

Comment on lines +46 to +61
auto estimatePrecision(T timestamp) -> InputTimestampPrecision {
constexpr int64_t kEpochMilliseconds1971{31536000000};
constexpr int64_t kEpochMicroseconds1971{31536000000000};
constexpr int64_t kEpochNanoseconds1971{31536000000000000};
auto absTimestamp = timestamp >= 0 ? timestamp : -timestamp;

if (absTimestamp > kEpochNanoseconds1971) {
return InputTimestampPrecision::Nanoseconds;
} else if (absTimestamp > kEpochMicroseconds1971) {
return InputTimestampPrecision::Microseconds;
} else if (absTimestamp > kEpochMilliseconds1971) {
return InputTimestampPrecision::Milliseconds;
} else {
return InputTimestampPrecision::Seconds;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Precision estimation heuristic may misclassify timestamps near epoch

The heuristic for determining timestamp precision relies on magnitude thresholds that can fail for timestamps close to the Unix epoch (1970). For example, a timestamp of 1000 seconds since epoch (16 minutes 40 seconds after midnight on January 1, 1970) would be misclassified as milliseconds since it's less than kEpochMilliseconds1971.

This could lead to incorrect timestamp conversions, especially for historical data or test scenarios using epoch-relative timestamps.

Consider adding explicit precision metadata or using a more robust heuristic that accounts for the current time range. For example:

 template <typename T>
 auto estimatePrecision(T timestamp) -> InputTimestampPrecision {
   constexpr int64_t kEpochMilliseconds1971{31536000000};
   constexpr int64_t kEpochMicroseconds1971{31536000000000};
   constexpr int64_t kEpochNanoseconds1971{31536000000000000};
+  // Current time bounds for validation (approximate)
+  constexpr int64_t kCurrentEpochSeconds{1700000000};  // ~2023
+  constexpr int64_t kFutureEpochSeconds{2000000000};   // ~2033
+  
   auto absTimestamp = timestamp >= 0 ? timestamp : -timestamp;

+  // Additional check: if value is reasonable as seconds (1970-2033 range)
+  if (absTimestamp > kCurrentEpochSeconds && absTimestamp < kFutureEpochSeconds) {
+    return InputTimestampPrecision::Seconds;
+  }
+
   if (absTimestamp > kEpochNanoseconds1971) {
     return InputTimestampPrecision::Nanoseconds;
   } else if (absTimestamp > kEpochMicroseconds1971) {
     return InputTimestampPrecision::Microseconds;
   } else if (absTimestamp > kEpochMilliseconds1971) {
     return InputTimestampPrecision::Milliseconds;
   } else {
     return InputTimestampPrecision::Seconds;
   }
 }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 190 to +194
auto projectedColumn = outputColumns_[readerIndex_];
auto projectedColumnType = projectedColumn.type;
auto it = projectedColumnIdxNodeIdMap_.find(readerIndex_);
bool isResolved = it != projectedColumnIdxNodeIdMap_.end();
::clp::ffi::SchemaTree::Node::id_t projectedColumnNodeId;
auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Avoid a copy of the projected column.

Bind by const reference; outputColumns_ outlives this scope.

Apply:

-  auto projectedColumn = outputColumns_[readerIndex_];
+  const auto& projectedColumn = outputColumns_[readerIndex_];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto projectedColumn = outputColumns_[readerIndex_];
auto projectedColumnType = projectedColumn.type;
auto it = projectedColumnIdxNodeIdMap_.find(readerIndex_);
bool isResolved = it != projectedColumnIdxNodeIdMap_.end();
::clp::ffi::SchemaTree::Node::id_t projectedColumnNodeId;
auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);
const auto& projectedColumn = outputColumns_[readerIndex_];
auto projectedColumnType = projectedColumn.type;
auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp around lines 192-194, the
code currently copies outputColumns_[readerIndex_] into projectedColumn which is
unnecessary because outputColumns_ outlives this scope; change the binding to a
const reference (e.g., const auto& projectedColumn =
outputColumns_[readerIndex_]) and also make projectedColumnType a reference if
you need to avoid copying its underlying type (e.g., const auto&
projectedColumnType = projectedColumn.type) so no unnecessary copies occur.

Comment on lines 206 to 212
std::make_unique<ClpIrVectorLoader>(
irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),
isResolved,
projectedColumnType,
projectedColumnNodeId,
irDeserializer_->get_ir_unit_handler().getFilteredLogEvents()),
std::move(projectedColumnNodeIds),
projectedColumn.name,
projectedColumnType),
std::move(vector));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Move of node IDs into loader — LGTM; consider avoiding the prior copy.

Passing std::move(projectedColumnNodeIds) matches the constructor. If the map entry isn’t needed later, you can move directly from the map to skip the intermediate copy.

Option if safe:

-  if (isResolved) {
-    projectedColumnNodeIds = it->second;
-  }
+  if (isResolved) {
+    projectedColumnNodeIds = std::move(it->second); // only if map isn’t used later
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::make_unique<ClpIrVectorLoader>(
irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),
isResolved,
projectedColumnType,
projectedColumnNodeId,
irDeserializer_->get_ir_unit_handler().getFilteredLogEvents()),
std::move(projectedColumnNodeIds),
projectedColumn.name,
projectedColumnType),
std::move(vector));
const auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);
bool isResolved = it != projectedColumnIdxNodeIdsMap_.end();
std::vector<::clp::ffi::SchemaTree::Node::id_t> projectedColumnNodeIds;
if (isResolved) {
projectedColumnNodeIds = std::move(it->second); // only if map isn’t used later
}
std::make_unique<ClpIrVectorLoader>(
irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),
isResolved,
std::move(projectedColumnNodeIds),
projectedColumn.name,
projectedColumnType),
std::move(vector));
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp around lines 206–212, the
code currently passes std::move(projectedColumnNodeIds) into the
ClpIrVectorLoader but creates projectedColumnNodeIds as an intermediate copy
beforehand; instead avoid that extra copy by moving the map entry directly into
the loader (e.g., std::move(it->second>) where you have the iterator/key) when
constructing the loader, ensure the map entry is not used afterwards, remove the
intermediate projectedColumnNodeIds variable and any of its uses, and verify
iterator validity and const-correctness so the move is safe.

@anlowee anlowee requested a review from gibber9809 September 25, 2025 18:19
gibber9809
gibber9809 previously approved these changes Sep 25, 2025
Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. For title how about

feat: Add timestamp type support for kv-ir splits.

@anlowee anlowee changed the title feat: Add timestamp type support for IR stream feat: Add timestamp type support for kv-ir splits. Sep 25, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor edits.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6a9e49 and 4c3e4e8.

📒 Files selected for processing (1)
  • velox/docs/develop/connectors.rst (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/docs/develop/connectors.rst
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/docs/develop/connectors.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/docs/develop/connectors.rst (2)

141-146: Add See‑also cross‑references to concrete cursors.

Link readers to the concrete types that implement the two split kinds.

Apply:

-``BaseClpCursor`` is responsible for preparing pushdown operations, loading splits, filtering data, and returning
-results. Each split type (archive or IR stream) has its own corresponding subclass (``ClpArchiveCursor`` and
-``ClpIrCursor``).
+``BaseClpCursor`` is responsible for preparing pushdown operations, loading splits, filtering data, and returning
+results. Each split type (archive or IR stream) has its own corresponding subclass (``ClpArchiveCursor`` and
+``ClpIrCursor``). See also: ``ClpArchiveCursor``, ``ClpIrCursor``.

157-161: Tighten wording: instantiate a concrete cursor; mention timestamp conversion path introduced in this PR.

Clarify that a concrete subclass is created (not the abstract base), and briefly document the timestamp conversion flow.

Apply:

-When a split is added, a ``BaseClpCursor`` instance is created with the split path and input source (which
-may be either a ``ClpArchiveCursor`` or a ``ClpIrCursor``). The query is parsed and simplified into an AST.
+When a split is added, a concrete cursor (``ClpArchiveCursor`` or ``ClpIrCursor``) is created with the split path and
+input source; both derive from ``BaseClpCursor``. The query is parsed and simplified into an AST. For timestamp
+columns, CLP‑s’ ``TimestampDictionaryWriter`` may convert string timestamps to integers, which are then converted to
+Velox timestamps via ``ClpTimestampsUtils`` (handles precision normalization).

Please confirm the exact utility/function names you want to cite from ClpTimestampsUtils (e.g., precision handling entry points), and whether the flow always uses TimestampDictionaryWriter or only for string timestamps. Based on learnings.

Comment on lines +138 to +140
``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR
(Internal Representation) stream.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Fix IR expansion and document allowed type tokens (case‑sensitive).

“IR” is expanded as “Internal Representation”; it should be “Intermediate Representation”. Also, please list the exact literal values accepted by the type property and whether they’re case‑sensitive.

Apply:

-``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR
-(Internal Representation) stream.
+``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR
+(Intermediate Representation) stream. Allowed values: ``archive`` or ``ir`` (case‑sensitive).

To verify the exact tokens/casing against the implementation, run:


🏁 Script executed:

#!/bin/bash
# Find split type parsing/validation and accepted literals.
rg -nP -C3 '(ClpConnectorSplit|Split).*type' --glob 'velox/**'
rg -nP -C3 '\b(archive|ir)\b' --type=cpp --type=cc --type=cxx --type=h --glob 'velox/**'
rg -nP -C3 '(enum|class)\s+.*Split.*Type|k[A-Z][A-Za-z]*Split' --glob 'velox/**'

Length of output: 31892


Fix IR expansion and document exact enum values for type.

Update “IR (Internal Representation)” to “IR (Intermediate Representation)” and clarify that the type property takes the SplitType enum values: kArchive (0) or kIr (1).

🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 138 to 140, update the
description of IR and the `type` property: change “IR (Internal Representation)”
to “IR (Intermediate Representation)” and explicitly state that the `type`
property uses the SplitType enum with values `kArchive` (0) and `kIr` (1);
modify the sentence to read something like: "ClpConnectorSplit also includes a
`type` property that specifies whether the split is an archive or an IR
(Intermediate Representation) stream, using the SplitType enum: `kArchive` (0)
or `kIr` (1)." Ensure punctuation and code formatting match surrounding docs.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving docs.

@anlowee anlowee merged commit 12521a1 into y-scope:presto-0.293-clp-connector Sep 29, 2025
7 checks passed
@anlowee anlowee deleted the xwei/ir-integration-phase2-part3 branch September 29, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants