feat: Add timestamp type support for kv-ir splits. #30

anlowee · 2025-09-10T18:45:49Z

Description

This PR is the successor of #29 to handle the timestamp type. We first extracted the timestamp convertion utils from ClpArchiveVectorLoader to share with IR cases for integer and float timestamps. Then we use clp-s's TimestampDictionaryWriter to convert string timestamps to integer then use the util to convert to Velox's timestamp.

This PR also merged some unit test cases which were separated because the timestamp type had not been supported.

This PR also added the docs for IR in velox/docs/develop/connectors.rst.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Passed the CI.

Summary by CodeRabbit

New Features
- IR execution now supports TIMESTAMP columns with robust parsing of ISO 8601 and numeric timestamps across varying precisions.
- Projected columns can map to multiple source fields, improving flexibility with complex schemas.
Documentation
- Connector docs updated to introduce a base cursor with specialised archive/IR implementations and split-type descriptions.
Tests
- Added IR timestamp tests (with and without pushdown) and new NDJSON test data; IR-path tests streamlined.
Chores
- Timestamp utilities exposed as a shared helper for reuse.

…ementation to ClpArchiveCursor

…n of ClpIrCursor

…VectorLoader

…ch related code

…ion-phase2-part2

…ase2-part1

…ion-phase2-part2

coderabbitai · 2025-09-10T18:45:56Z

Walkthrough

Extracts timestamp utilities into a new public header, expands IR projection to support multiple node IDs per projected column, updates ClpIrVectorLoader constructor and lookup logic, adds timestamp handling in IR loader, updates CMake and docs, and adjusts tests and test data for IR timestamp coverage.

Changes

Cohort / File(s)	Summary of Changes
Timestamp utilities extraction `velox/connectors/clp/search_lib/ClpTimestampsUtils.h`, `velox/connectors/clp/search_lib/CMakeLists.txt`, `velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp`	Adds `ClpTimestampsUtils.h` (precision estimation and conversion APIs) and exposes it via CMake; replaces in-file timestamp helpers in archive loader with the new header.
IR projection: single → multi node IDs; callback update `velox/connectors/clp/search_lib/ir/ClpIrCursor.h`, `velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp`, `velox/connectors/clp/search_lib/ir/CMakeLists.txt`	Changes projected-column map type from single `Node::id_t` to `std::vector<Node::id_t>`; projectionResolutionCallback now accumulates multiple node IDs per key; adjusts related aliases and adds `clp_s::search::ast` to PRIVATE link libs.
IR loader: constructor/signature & multi-key lookup `velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h`, `velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp`	Reorders and expands constructor to accept filtered events first, then isResolved, vector of nodeIds, nodeName, and nodeType; replaces single-node lookup with searching candidate `nodeIds[]` and selecting the first present value; internal state now uses `nodeIds_` and `nodeName_`.
IR loader: Timestamp column support `velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp`, `velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp`	Adds explicit handling for `Timestamp` column type in IR loader, converting `double`/`int64_t` inputs to Velox `Timestamp` via the new timestamp utils; archive loader includes the utils header to remove inline helpers.
Tests: IR path simplification and timestamp coverage `velox/connectors/clp/tests/ClpConnectorTest.cpp`, `velox/connectors/clp/tests/examples/test_4.ndjson`	Removes duplicated IR plan construction in tests and reuses existing plans for IR execution; adds two IR timestamp tests (with and without pushdown); adds NDJSON test data containing ISO8601 and numeric timestamps.
Docs: cursor architecture update `velox/docs/develop/connectors.rst`	Documents new `BaseClpCursor` concept with concrete subclasses (`ClpArchiveCursor`, `ClpIrCursor`) and adds a `type` discriminator to `ClpConnectorSplit`; updates CLP data source flow wording to reflect the base/specialized cursor architecture.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Split
  participant DS as ClpDataSource
  participant C as BaseClpCursor<br/>(ClpIrCursor)
  participant L as ClpIrVectorLoader
  participant E as FilteredLogEvents

  Note over S,DS: IR split selected
  S->>DS: Provide split (type=IR)
  DS->>C: Create ClpIrCursor
  C->>C: Resolve projected columns<br/>to node ID vectors
  C->>L: ClpIrVectorLoader(filteredEvents,<br/>isResolved, nodeIds[], columnName, columnType)
  L->>E: Iterate events
  rect rgba(200,220,255,0.25)
    Note right of L: Multi-ID lookup
    L->>L: For nodeIds[] find first<br/>present value
  end
  alt Value found
    L-->>C: Append to column vector<br/>(incl. Timestamp via utils)
  else None found
    L-->>C: Skip row
  end
  C-->>DS: Vectors for projected columns
  DS-->>S: Return results

sequenceDiagram
  autonumber
  participant V as Value
  participant U as ClpTimestampsUtils
  participant T as Velox::Timestamp

  Note over V,U: Timestamp conversion
  V->>U: convertToVeloxTimestamp(value)
  alt double
    U->>U: estimatePrecision(double)
    U->>T: Build seconds + nanos
  else int64_t
    U->>U: estimatePrecision(int64_t)
    U->>T: Normalize to seconds + nanos
  else other
    U-->>V: Error (unsupported)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “feat: Add timestamp type support for kv-ir splits.” succinctly and accurately captures the core enhancement of introducing timestamp handling into key-value IR splits, mirroring the utility extraction, code updates, tests, and documentation changes in this PR. It is focused, clear, and omits extraneous details or vague terminology, allowing reviewers to immediately grasp the main change.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae5119 and 10e222a.

📒 Files selected for processing (10)

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
velox/connectors/clp/search_lib/BaseClpCursor.h (1 hunks)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (0 hunks)
velox/connectors/clp/search_lib/ir/CMakeLists.txt (1 hunks)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
velox/connectors/clp/search_lib/ir/ClpIrCursor.h (2 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (3 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
velox/connectors/clp/tests/ClpConnectorTest.cpp (4 hunks)
velox/connectors/clp/tests/examples/test_4.ndjson (1 hunks)

💤 Files with no reviewable changes (1)

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-20T18:13:32.442Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

🧬 Code graph analysis (3)

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

velox/type/Timestamp.h (1)

kMillisecondsInSecond (80-157)

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

velox/connectors/clp/search_lib/BaseClpCursor.cpp (4)

convertToVeloxTimestamp (51-69)

convertToVeloxTimestamp (51-51)

convertToVeloxTimestamp (71-98)

convertToVeloxTimestamp (71-71)

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)

readerIndex_ (82-82)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build with GCC / Ubuntu release with resolve_dependency

🔇 Additional comments (10)

velox/connectors/clp/search_lib/ir/CMakeLists.txt (2)

25-25: Confirm PUBLIC vs PRIVATE for archive_writer on a STATIC target

Because clp-s-ir-search is STATIC, PUBLIC is often necessary to propagate link dependencies to final executables. If no downstream target needs to directly reference archive_writer symbols (only this static lib does), PRIVATE can reduce leakage. Please confirm transitive link needs and keep PUBLIC only if consumers require it.

32-32: Validate necessity of linking clp_s::search::ast

If this target’s object files don’t reference symbols from clp_s::search::ast (headers only), you can drop this to avoid unnecessary linkage. If it is required indirectly via ClpIrCursor.* types, keep it.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

71-98: LGTM: int64_t→Timestamp covers seconds/ms/µs/ns and negatives correctly

Division/modulo with post‑adjustment for negatives yields canonical (sec, nanos). Nice.

velox/connectors/clp/tests/ClpConnectorTest.cpp (4)

261-267: IR parity assertion added — good

Great to assert IR produces the same result as archive for this plan.

318-324: IR parity assertion added — good

Same here; validates pushdown path parity.

378-385: IR parity for hybrid plan — good

Covers mixed projection with computed filter. Nice.

418-445: Extend test4IrTimestampNoPushdown to cover µs/ns and string timestamps

test_4_ir.clps contains an ISO8601 string timestamp and additional encoded timestamp representations; update velox/connectors/clp/tests/ClpConnectorTest.cpp (test4IrTimestampNoPushdown, ~lines 418–445) to assert all qualifying rows for second/microsecond/nanosecond precisions and add a separate assertion (or test) that exercises the ISO8601 string (dictionary) path.
Shall I implement the test changes to assert sec/µs/ns and add the string-timestamp case?
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)
103-123: Prefer std::string_view for dictionary ingestion; avoid raw .data()

Passing .data() drops length and risks embedded NULs — prefer std::string_view (or const std::string&) if the ingest_entry overload supports it.

File: velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (around lines 113–119)
-        } else if (value->is<std::string>()) {
-          auto stringValue = value->get_immutable_view<std::string>().data();
-          uint64_t encodingId{};
-          auto const timestamp = timestampDict_.ingest_entry(
-              nodeName_, nodeId, stringValue, encodingId);
-          timestampVector->set(vectorIndex, convertToVeloxTimestamp(timestamp));
+        } else if (value->is<std::string>()) {
+          const auto& s = value->get_immutable_view<std::string>();
+          std::string_view sv{s.data(), s.size()};
+          uint64_t encodingId{0};
+          auto ts = timestampDict_.ingest_entry(nodeName_, nodeId, sv, encodingId);
+          timestampVector->set(vectorIndex, convertToVeloxTimestamp(ts));
         } else {
I could not find clp_s/TimestampDictionaryWriter.hpp in the repo to confirm ingest_entry's signature — verify it accepts std::string_view; if not, pass the string with length (s, s.size() or s.c_str() plus length) per the actual API.
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)

62-66: LGTM: size invariant check

The check aligns projected node‑id maps with output columns.

193-198: LGTM: multi‑ID lookup wired correctly

Resolution check and capture of per‑column node IDs look correct.

Please confirm projectionResolutionCallback_ always inserts a map entry per output column (even if unresolved) to satisfy the size check above.

velox/connectors/clp/search_lib/BaseClpCursor.cpp

velox/connectors/clp/search_lib/BaseClpCursor.h

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h

velox/connectors/clp/tests/ClpConnectorTest.cpp

velox/connectors/clp/tests/examples/test_4.ndjson

anlowee · 2025-09-15T20:45:37Z

Going to fix some of the coderabbiai comments first.

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10e222a and 1928748.

📒 Files selected for processing (1)

velox/docs/develop/connectors.rst (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
GitHub Check: Build

velox/docs/develop/connectors.rst

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

velox/connectors/clp/search_lib/BaseClpCursor.h (1)
19-27: Missing headers for std::floor and fixed-width integers

This header uses std::floor, int64_t, uint64_t, and uint8_t but doesn’t include /. Add explicit includes to avoid brittle transitive dependencies.
 #include <memory>
 #include <string>
 #include <string_view>
 #include <vector>
+#include <cmath>
+#include <cstdint>
 
 #include "clp_s/InputConfig.hpp"
 #include "velox/connectors/clp/ClpConnectorSplit.h"
 #include "velox/type/Timestamp.h"

♻️ Duplicate comments (6)

velox/connectors/clp/tests/examples/test_4.ndjson (1)

1-3: Add one more ISO‑8601 row to fully exercise the string‑dictionary path

Great start. To validate uniqueness/order and hit the dictionary more than once, add a second string timestamp (later than Line 1) and update the test to assert both are ingested correctly.

Example to append:
{"timestamp": "2025-04-30T08:50:06.500Z"}
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)
21-21: Reduce transitive include and avoid eager construction of TimestampDictionaryWriter

Rebuild surface: including TimestampDictionaryWriter in a public header + holding it by value expands transitive dependencies and alloc/cost for non‑Timestamp columns. Forward‑declare and hold it lazily.

Minimal change using std::optional (keeps current usage pattern):
+#include <optional>
-#include "clp_s/TimestampDictionaryWriter.hpp"
+#include "clp_s/TimestampDictionaryWriter.hpp" // keep for now if optional stays in header
...
-  ColumnType nodeType_;
-  clp_s::TimestampDictionaryWriter timestampDict_;
+  ColumnType nodeType_;
+  std::optional<clp_s::TimestampDictionaryWriter> timestampDict_;
Pair with the .cpp change to emplace on first use (see suggested diff in ClpIrVectorLoader.cpp).

If compile times regress, consider a pimpl: forward‑declare the writer here and store std::unique_ptr<...>; define the destructor and all uses in the .cpp.

Also applies to: 50-55
velox/docs/develop/connectors.rst (2)
157-161: Tighten wording and mention the timestamp conversion flow

Clarify concrete subclass instantiation and note the timestamp path added in this PR.
-When a split is added, a ``BaseClpCursor`` instance is created with the split path and input source, which
-may be either a ``ClpArchiveCursor`` or a ``ClpIrCursor``. The query is parsed and simplified into an AST.
+When a split is added, a concrete cursor (``ClpArchiveCursor`` or ``ClpIrCursor``) is created with the split path and input source; both derive from ``BaseClpCursor``. The query is parsed and simplified into an AST. For timestamp columns, string timestamps may be converted to integers via CLP‑s’ ``TimestampDictionaryWriter`` and then to Velox timestamps via shared conversion utilities.
137-140: Spell out allowed values for the new split type

Make the accepted tokens explicit (and case‑sensitive if applicable).
- includes a ``type`` property that specifies whether the split is an archive or an IR (Intermediate Representation)
- stream.
+ includes a ``type`` property that specifies whether the split is an archive or an IR (Intermediate Representation)
+ stream. Allowed values: ``archive`` or ``ir`` (case‑sensitive).
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)
205-209: Good: moved node IDs to avoid a copy

std::move(projectedColumnNodeIds) here is the right optimisation and matches the constructor.

136-140: Include string literal types for timestamp projection

Without VarStringT|ClpStringT, string‑encoded timestamps won’t resolve via the dictionary path.

Apply:
-      case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+      case ColumnType::Timestamp:
+        // IR timestamp supported: accept numeric and string literal forms.
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1928748 and 1982c3b.

📒 Files selected for processing (6)

velox/connectors/clp/search_lib/BaseClpCursor.h (2 hunks)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)
velox/connectors/clp/tests/examples/test_4.ndjson (1 hunks)
velox/docs/develop/connectors.rst (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-20T18:00:51.241Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

velox/connectors/clp/search_lib/BaseClpCursor.h

📚 Learning: 2025-08-20T18:13:32.442Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

🧬 Code graph analysis (2)

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

velox/type/Timestamp.h (2)

TimestampPrecision (33-458)

kMillisecondsInSecond (80-157)

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)

readerIndex_ (82-82)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build with GCC / Ubuntu release with resolve_dependency

🔇 Additional comments (2)

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

36-45: LGTM: iterator/nodeId initialisation fixed

Initialising valueIt to end() and default‑initialising nodeId eliminates UB if no nodeId matches. Good fix.

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

31-44: Move nodeIds into the member (good), keep call sites passing std::move

Using std::move(nodeIds) avoids an extra copy; ensure call sites pass std::move(projectedColumnNodeIds) as done in ClpIrCursor.cpp.

velox/connectors/clp/search_lib/BaseClpCursor.h

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

velox/connectors/clp/search_lib/BaseClpCursor.h (1)
19-27: Missing headers for uint8_t and std::floor.

Add for uint8_t and for std::floor to avoid fragile transitive includes.
 #pragma once
 
 #include <memory>
 #include <string>
 #include <string_view>
 #include <vector>
+#include <cstdint>
+#include <cmath>
 
 #include "clp_s/InputConfig.hpp"
 #include "velox/connectors/clp/ClpConnectorSplit.h"
 #include "velox/type/Timestamp.h"

♻️ Duplicate comments (4)

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (2)

136-140: Include string literal types for Timestamp projections and update stale TODO.

IR can carry timestamps as strings; limit to Float|Integer drops them.

-      case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+      case ColumnType::Timestamp:
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;

62-66: Relax the equality check — allow unresolved columns.

Equality trips when some projections intentionally have no node IDs.

-  VELOX_CHECK_EQ(
-      projectedColumnIdxNodeIdsMap_.size(),
-      outputColumns_.size(),
-      "Resolved node-id map size ({}) must not exceed projected columns ({})",
-      projectedColumnIdxNodeIdsMap_.size(),
-      outputColumns_.size());
+  VELOX_CHECK_LE(
+      projectedColumnIdxNodeIdsMap_.size(),
+      outputColumns_.size(),
+      "Resolved node-id map size ({}) must not exceed projected columns ({})",
+      projectedColumnIdxNodeIdsMap_.size(),
+      outputColumns_.size());

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (1)

100-114: Timestamp branch leaves null bit set and lacks string‑timestamp support.

You set a value but never clear the null flag.
PR intent mentions string → integer conversion via TimestampDictionaryWriter; current code rejects strings.

Apply this diff to clear the null and handle strings (pairs with header change below):

       case ColumnType::Timestamp: {
         auto timestampVector = vector->asFlatVector<Timestamp>();
-        if (value->is<double>()) {
+        if (value->is<double>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<double>()));
-        } else if (value->is<int64_t>()) {
+        } else if (value->is<int64_t>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<int64_t>()));
+        } else if (value->is<std::string>()) {
+          // Lazily create the dictionary the first time we see a string ts.
+          if (!timestampDict_) {
+            timestampDict_ = std::make_unique<clp_s::TimestampDictionaryWriter>();
+          }
+          auto const& stringValue = value->get_immutable_view<std::string>();
+          uint64_t encodingId{};
+          auto const tsInt = timestampDict_->ingest_entry(
+              nodeName_, nodeId, stringValue, encodingId);
+          timestampVector->set(vectorIndex, convertToVeloxTimestamp(tsInt));
         } else {
           VELOX_FAIL("Unsupported timestamp type");
         }
+        vector->setNull(vectorIndex, false);
         break;
       }

And in this .cpp top‑section, include the writer and define the out‑of‑line dtor:

+#include "clp_s/TimestampDictionaryWriter.hpp"
+
+ClpIrVectorLoader::~ClpIrVectorLoader() = default;

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)

49-53: Add a writer handle for string timestamps (forward‑declared) to avoid header bloat.

Forward‑declare the writer and store a unique_ptr; define the destructor out‑of‑line in the .cpp that includes the writer header (see paired .cpp diff above).

 namespace facebook::velox::connector::clp::search_lib {
 
 class ClpIrVectorLoader : public VectorLoader {
  public:
+  ~ClpIrVectorLoader() override;
@@
  private:
   simdjson::ondemand::parser arrayParser_;
   std::shared_ptr<
       const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>
       filteredLogEvents_;
   bool isResolved_;
   std::vector<::clp::ffi::SchemaTree::Node::id_t> nodeIds_;
   std::string nodeName_;
   ColumnType nodeType_;
+  // Lazily initialised in Timestamp branch.
+  std::unique_ptr<clp_s::TimestampDictionaryWriter> timestampDict_;

And add a forward declaration at the top of this header:

 #include "velox/connectors/clp/search_lib/BaseClpCursor.h"
 #include "velox/vector/FlatVector.h"
 #include "velox/vector/LazyVector.h"
 
+namespace clp_s { class TimestampDictionaryWriter; }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1982c3b and 947fded.

📒 Files selected for processing (5)

velox/connectors/clp/search_lib/BaseClpCursor.h (2 hunks)
velox/connectors/clp/search_lib/ir/CMakeLists.txt (1 hunks)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-20T18:13:32.442Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h

📚 Learning: 2025-08-20T18:00:51.241Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

velox/connectors/clp/search_lib/BaseClpCursor.h

🧬 Code graph analysis (2)

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

velox/connectors/clp/search_lib/ir/ClpIrCursor.h (1)

readerIndex_ (82-82)

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

velox/type/Timestamp.h (1)

kMillisecondsInSecond (80-157)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build

🔇 Additional comments (7)

velox/connectors/clp/search_lib/ir/CMakeLists.txt (1)

32-32: Add of clp_s::search::ast looks correct (PRIVATE).

Linking the AST lib is appropriate given IR files reference search::ast types.

Please confirm there are no public headers in this target exposing search::ast symbols; if there are, flip to PUBLIC.

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2)

36-44: Good fix: iterator and nodeId initialisation eliminates UB.

Initialising valueIt to end() and nodeId to {} avoids dereferencing garbage when no node matches.

45-47: Selection precedence relies on nodeIds_ order — document/verify.

You pick the first present nodeId; behaviour depends on nodeIds_ order. Document this precedence and add a test asserting deterministic selection.

Also applies to: 49-49

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

64-70: Timestamp precision utilities look solid.

Naming avoids clashes (InputTimestampPrecision) and the heuristic is reasonable given current formats.

Also applies to: 86-103

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)

193-199: Accurate resolved‑state and move semantics — LGTM.

Treating empty vectors as unresolved and moving node IDs into the loader avoids ambiguity and copies.

Also applies to: 206-210

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2)

22-22: Include path correction — LGTM.

Using the absolute project include keeps headers consistent.

33-43: Constructor/state wiring looks good; avoids copies.

Passing nodeIds by value and moving into the member is acceptable given the call site moves it.

velox/connectors/clp/search_lib/BaseClpCursor.h

gibber9809

Nice work! Just a few minor comments.

It also might make sense to move the timestamp heuristic stuff into its own header instead of keeping it in BaseClpCursor.h, but I'll leave that up to you.

gibber9809 · 2025-09-19T15:01:49Z

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

        // TODO: IR timestamp support pending; constrain to Unknown to avoid
        // mismatched projections.


Should be able to delete this comment and/or update it to something mentioning future string timestamp support?

I just realized that Presto might only store the timestamp type data either in milliseconds or microseconds, and both of them are numeric actually. If that is the case and we only use velox for Presto, then we might don't need to support this. Let me double check.

Just confirmed. Presto stores the TIMESTAMP as integer or a packed number for TIMESTAMP WITH TIMEZONE. So I expect that the timestamp pushdown passed to Velox will never contain date string. Therefore, the issue here is if the timestamp field in the IR file is actually a string, then it will have issues.

I will just remove this comments when we have a concrete idea to address this difference between IR and SFA. I have an issue to track this: #32

gibber9809 · 2025-09-19T15:05:10Z

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp

+    for (size_t i{0}; i < nodeIds_.size(); ++i) {
+      valueIt = userGenNodeIdValueMap.find(nodeIds_[i]);
+      if (valueIt != userGenNodeIdValueMap.end()) {
+        nodeId = nodeIds_[i];
+        break;
+      }
+    }


Suggested change

for (size_t i{0}; i < nodeIds_.size(); ++i) {

valueIt = userGenNodeIdValueMap.find(nodeIds_[i]);

if (valueIt != userGenNodeIdValueMap.end()) {

nodeId = nodeIds_[i];

break;

}

}

for (auto const candidateNodeId : nodeIds_) {

valueIt = userGenNodeIdValueMap.find(candidateNodeId);

if (valueIt != userGenNodeIdValueMap.end()) {

nodeId = candidateNodeId;

break;

}

}

Sort of up to taste, but I think this is a bit more readable.

gibber9809 · 2025-09-19T15:06:26Z

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h

  std::shared_ptr<
      const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>
      filteredLogEvents_;
+  bool isResolved_;


Suggested change

bool isResolved_;

bool isResolved_{};

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (1)
179-186: Pre‑reserve children to avoid reallocations.

Micro‑optimisation: reserve space before pushing back.

Apply:
     std::vector<VectorPtr> children;
     auto& rowType = vectorType->as<TypeKind::ROW>();
+    children.reserve(rowType.size());

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 947fded and 03d251d.

📒 Files selected for processing (6)

velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1 hunks)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1 hunks)
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (4 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (3 hunks)
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-20T18:13:32.442Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

📚 Learning: 2025-08-20T18:00:51.241Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

🧬 Code graph analysis (1)

velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)

velox/type/Timestamp.h (1)

kMillisecondsInSecond (80-157)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build with GCC / Ubuntu release with resolve_dependency

🔇 Additional comments (9)

velox/connectors/clp/search_lib/CMakeLists.txt (1)

14-23: LGTM!

The addition of ClpTimestampsUtils.h to the public headers is appropriate for the timestamp utility refactoring.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)

17-24: LGTM!

Good refactoring to extract timestamp utilities into a shared header. This promotes code reuse between the archive and IR components.
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.cpp (2)
38-46: Initialize iterator and nodeId to safe defaults

The variables valueIt and nodeId should be initialized to safe values to prevent undefined behaviour if no matching node ID is found in the loop.

Apply this fix:
-    auto valueIt = userGenNodeIdValueMap.end();
-    ::clp::ffi::SchemaTree::Node::id_t nodeId{};
+    auto valueIt = userGenNodeIdValueMap.end();
+    ::clp::ffi::SchemaTree::Node::id_t nodeId{};
Actually, the code already initializes these correctly. The initialization is appropriate.

102-116: Clear null flag after setting timestamp value

The timestamp case sets values but never clears the null flag that was set at Line 31, leaving all timestamp entries marked as null despite having values.

Apply this fix:
       case ColumnType::Timestamp: {
         auto timestampVector = vector->asFlatVector<Timestamp>();
         if (value->is<double>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<double>()));
         } else if (value->is<int64_t>()) {
           timestampVector->set(
               vectorIndex,
               convertToVeloxTimestamp(value->get_immutable_view<int64_t>()));
         } else {
           VELOX_FAIL("Unsupported timestamp type");
         }
+        vector->setNull(vectorIndex, false);
         break;
       }
velox/connectors/clp/search_lib/ir/ClpIrVectorLoader.h (1)
30-42: Move nodeIds vector to avoid unnecessary copy

The constructor takes nodeIds by value and copies it into the member. Consider moving it instead to avoid the allocation and copy.

Apply this change:
   ClpIrVectorLoader(
       const std::shared_ptr<
           const std::vector<std::unique_ptr<::clp::ffi::KeyValuePairLogEvent>>>&
           filteredLogEvents,
       bool isResolved,
-      std::vector<::clp::ffi::SchemaTree::Node::id_t> nodeIds,
+      std::vector<::clp::ffi::SchemaTree::Node::id_t>&& nodeIds,
       std::string_view nodeName,
       ColumnType nodeType)
       : filteredLogEvents_(filteredLogEvents),
         isResolved_(isResolved),
         nodeIds_(std::move(nodeIds)),
         nodeName_(nodeName),
         nodeType_(nodeType) {}
Also update the call sites to pass std::move(nodeIds).
velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)
92-118: Integer division incorrect for determining seconds in nanosecond precision

When the precision is nanoseconds, Line 109 calculates seconds = timestamp / precisionDifference where precisionDifference = Timestamp::kNanosInSecond. This is correct.

However, when precision is seconds, Line 106 sets precisionDifference = Timestamp::kNanosInSecond / Timestamp::kNanosInSecond = 1, making Line 109 compute seconds = timestamp / 1 = timestamp, which is correct.

But for microseconds and milliseconds, the calculation appears inverted. The code divides by a smaller number instead of multiplying, which would give incorrect results.

The logic should be:

For nanoseconds: divide by 1e9 to get seconds

For microseconds: divide by 1e6 to get seconds

For milliseconds: divide by 1e3 to get seconds

For seconds: value is already in seconds

Apply this fix:
 inline auto convertToVeloxTimestamp(int64_t timestamp) -> Timestamp {
-  int64_t precisionDifference{Timestamp::kNanosInSecond};
+  int64_t divisor{1};
   switch (estimatePrecision(timestamp)) {
     case InputTimestampPrecision::Nanoseconds:
+      divisor = Timestamp::kNanosInSecond;
       break;
     case InputTimestampPrecision::Microseconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosecondsInMicrosecond;
+      divisor = Timestamp::kMicrosecondsInSecond;
       break;
     case InputTimestampPrecision::Milliseconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosecondsInMillisecond;
+      divisor = Timestamp::kMillisecondsInSecond;
       break;
     case InputTimestampPrecision::Seconds:
-      precisionDifference =
-          Timestamp::kNanosInSecond / Timestamp::kNanosInSecond;
+      divisor = 1;
       break;
   }
-  int64_t seconds{timestamp / precisionDifference};
+  int64_t seconds{timestamp / divisor};
   int64_t nanoseconds{
-      (timestamp % precisionDifference) *
-      (Timestamp::kNanosInSecond / precisionDifference)};
+      (timestamp % divisor) * (Timestamp::kNanosInSecond / divisor)};
   if (nanoseconds < 0) {
     seconds -= 1;
     nanoseconds += Timestamp::kNanosInSecond;
   }
   return Timestamp(seconds, static_cast<uint64_t>(nanoseconds));
 }
Likely an incorrect or invalid review comment.
velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp (3)
194-201: Correctly gate resolution on non‑empty node‑id lists — LGTM.

The isResolved logic now requires a non‑empty vector. This fixes the earlier ambiguity.

62-68: Fix assertion: allow unresolved projected columns (use <=, not ==).

The equality check contradicts the later handling of unresolved columns and will fail legitimately. Use VELOX_CHECK_LE.

Apply:
-  VELOX_CHECK_EQ(
+  VELOX_CHECK_LE(
       projectedColumnIdxNodeIdsMap_.size(),
       outputColumns_.size(),
       "Resolved node-id map size ({}) must not exceed projected columns ({})",
       projectedColumnIdxNodeIdsMap_.size(),
       outputColumns_.size());
137-141: Include string literal types for timestamps (VarStringT, ClpStringT).

IR can carry timestamps as strings; restricting to Float|Integer drops those paths and defeats the dictionary+utility conversion.

Apply:
       case ColumnType::Timestamp:
-        // TODO: IR timestamp support pending; constrain to Unknown to avoid
-        // mismatched projections.
-        literalType = search::ast::LiteralType::FloatT |
-            search::ast::LiteralType::IntegerT;
+        // IR may encode timestamps as numeric (epoch millis/micros) or as strings;
+        // accept both numeric and string literal types so conversion can occur.
+        literalType = search::ast::LiteralType::FloatT |
+            search::ast::LiteralType::IntegerT |
+            search::ast::LiteralType::VarStringT |
+            search::ast::LiteralType::ClpStringT;
         break;

coderabbitai · 2025-09-22T15:29:53Z

velox/connectors/clp/search_lib/ClpTimestampsUtils.h

+auto estimatePrecision(T timestamp) -> InputTimestampPrecision {
+  constexpr int64_t kEpochMilliseconds1971{31536000000};
+  constexpr int64_t kEpochMicroseconds1971{31536000000000};
+  constexpr int64_t kEpochNanoseconds1971{31536000000000000};
+  auto absTimestamp = timestamp >= 0 ? timestamp : -timestamp;
+
+  if (absTimestamp > kEpochNanoseconds1971) {
+    return InputTimestampPrecision::Nanoseconds;
+  } else if (absTimestamp > kEpochMicroseconds1971) {
+    return InputTimestampPrecision::Microseconds;
+  } else if (absTimestamp > kEpochMilliseconds1971) {
+    return InputTimestampPrecision::Milliseconds;
+  } else {
+    return InputTimestampPrecision::Seconds;
+  }
+}


⚠️ Potential issue

Precision estimation heuristic may misclassify timestamps near epoch

The heuristic for determining timestamp precision relies on magnitude thresholds that can fail for timestamps close to the Unix epoch (1970). For example, a timestamp of 1000 seconds since epoch (16 minutes 40 seconds after midnight on January 1, 1970) would be misclassified as milliseconds since it's less than kEpochMilliseconds1971.

This could lead to incorrect timestamp conversions, especially for historical data or test scenarios using epoch-relative timestamps.

Consider adding explicit precision metadata or using a more robust heuristic that accounts for the current time range. For example:

template <typename T> auto estimatePrecision(T timestamp) -> InputTimestampPrecision { constexpr int64_t kEpochMilliseconds1971{31536000000}; constexpr int64_t kEpochMicroseconds1971{31536000000000}; constexpr int64_t kEpochNanoseconds1971{31536000000000000}; + // Current time bounds for validation (approximate) + constexpr int64_t kCurrentEpochSeconds{1700000000}; // ~2023 + constexpr int64_t kFutureEpochSeconds{2000000000}; // ~2033 + auto absTimestamp = timestamp >= 0 ? timestamp : -timestamp; + // Additional check: if value is reasonable as seconds (1970-2033 range) + if (absTimestamp > kCurrentEpochSeconds && absTimestamp < kFutureEpochSeconds) { + return InputTimestampPrecision::Seconds; + } + if (absTimestamp > kEpochNanoseconds1971) { return InputTimestampPrecision::Nanoseconds; } else if (absTimestamp > kEpochMicroseconds1971) { return InputTimestampPrecision::Microseconds; } else if (absTimestamp > kEpochMilliseconds1971) { return InputTimestampPrecision::Milliseconds; } else { return InputTimestampPrecision::Seconds; } }

Committable suggestion skipped: line range outside the PR's diff.

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

coderabbitai · 2025-09-22T15:29:54Z

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

  auto projectedColumn = outputColumns_[readerIndex_];
  auto projectedColumnType = projectedColumn.type;
-  auto it = projectedColumnIdxNodeIdMap_.find(readerIndex_);
-  bool isResolved = it != projectedColumnIdxNodeIdMap_.end();
-  ::clp::ffi::SchemaTree::Node::id_t projectedColumnNodeId;
+  auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);


🧹 Nitpick

Avoid a copy of the projected column.

Bind by const reference; outputColumns_ outlives this scope.

Apply:

- auto projectedColumn = outputColumns_[readerIndex_]; + const auto& projectedColumn = outputColumns_[readerIndex_];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

auto projectedColumn = outputColumns_[readerIndex_];

auto projectedColumnType = projectedColumn.type;

auto it = projectedColumnIdxNodeIdMap_.find(readerIndex_);

bool isResolved = it != projectedColumnIdxNodeIdMap_.end();

::clp::ffi::SchemaTree::Node::id_t projectedColumnNodeId;

auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);

const auto& projectedColumn = outputColumns_[readerIndex_];

auto projectedColumnType = projectedColumn.type;

auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);

🤖 Prompt for AI Agents

In velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp around lines 192-194, the code currently copies outputColumns_[readerIndex_] into projectedColumn which is unnecessary because outputColumns_ outlives this scope; change the binding to a const reference (e.g., const auto& projectedColumn = outputColumns_[readerIndex_]) and also make projectedColumnType a reference if you need to avoid copying its underlying type (e.g., const auto& projectedColumnType = projectedColumn.type) so no unnecessary copies occur.

coderabbitai · 2025-09-22T15:29:54Z

velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp

      std::make_unique<ClpIrVectorLoader>(
+          irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),
          isResolved,
-          projectedColumnType,
-          projectedColumnNodeId,
-          irDeserializer_->get_ir_unit_handler().getFilteredLogEvents()),
+          std::move(projectedColumnNodeIds),
+          projectedColumn.name,
+          projectedColumnType),
      std::move(vector));


🧹 Nitpick

Move of node IDs into loader — LGTM; consider avoiding the prior copy.

Passing std::move(projectedColumnNodeIds) matches the constructor. If the map entry isn’t needed later, you can move directly from the map to skip the intermediate copy.

Option if safe:

- if (isResolved) { - projectedColumnNodeIds = it->second; - } + if (isResolved) { + projectedColumnNodeIds = std::move(it->second); // only if map isn’t used later + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::make_unique<ClpIrVectorLoader>(

irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),

isResolved,

projectedColumnType,

projectedColumnNodeId,

irDeserializer_->get_ir_unit_handler().getFilteredLogEvents()),

std::move(projectedColumnNodeIds),

projectedColumn.name,

projectedColumnType),

std::move(vector));

const auto it = projectedColumnIdxNodeIdsMap_.find(readerIndex_);

bool isResolved = it != projectedColumnIdxNodeIdsMap_.end();

std::vector<::clp::ffi::SchemaTree::Node::id_t> projectedColumnNodeIds;

if (isResolved) {

projectedColumnNodeIds = std::move(it->second); // only if map isn’t used later

}

std::make_unique<ClpIrVectorLoader>(

irDeserializer_->get_ir_unit_handler().getFilteredLogEvents(),

isResolved,

std::move(projectedColumnNodeIds),

projectedColumn.name,

projectedColumnType),

std::move(vector));

🤖 Prompt for AI Agents

In velox/connectors/clp/search_lib/ir/ClpIrCursor.cpp around lines 206–212, the code currently passes std::move(projectedColumnNodeIds) into the ClpIrVectorLoader but creates projectedColumnNodeIds as an intermediate copy beforehand; instead avoid that extra copy by moving the map entry directly into the loader (e.g., std::move(it->second>) where you have the iterator/key) when constructing the loader, ensure the map entry is not used afterwards, remove the intermediate projectedColumnNodeIds variable and any of its uses, and verify iterator validity and const-correctness so the move is safe.

gibber9809

LGTM. For title how about

feat: Add timestamp type support for kv-ir splits.

…ase2-part3

kirkrodrigues

Minor edits.

velox/docs/develop/connectors.rst

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6a9e49 and 4c3e4e8.

📒 Files selected for processing (1)

velox/docs/develop/connectors.rst (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-20T18:00:51.241Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

velox/docs/develop/connectors.rst

📚 Learning: 2025-08-20T16:58:56.256Z

Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

velox/docs/develop/connectors.rst

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build with GCC / Ubuntu release with resolve_dependency

🔇 Additional comments (2)

velox/docs/develop/connectors.rst (2)
141-146: Add See‑also cross‑references to concrete cursors.

Link readers to the concrete types that implement the two split kinds.

Apply:
-``BaseClpCursor`` is responsible for preparing pushdown operations, loading splits, filtering data, and returning
-results. Each split type (archive or IR stream) has its own corresponding subclass (``ClpArchiveCursor`` and
-``ClpIrCursor``).
+``BaseClpCursor`` is responsible for preparing pushdown operations, loading splits, filtering data, and returning
+results. Each split type (archive or IR stream) has its own corresponding subclass (``ClpArchiveCursor`` and
+``ClpIrCursor``). See also: ``ClpArchiveCursor``, ``ClpIrCursor``.
157-161: Tighten wording: instantiate a concrete cursor; mention timestamp conversion path introduced in this PR.

Clarify that a concrete subclass is created (not the abstract base), and briefly document the timestamp conversion flow.

Apply:
-When a split is added, a ``BaseClpCursor`` instance is created with the split path and input source (which
-may be either a ``ClpArchiveCursor`` or a ``ClpIrCursor``). The query is parsed and simplified into an AST.
+When a split is added, a concrete cursor (``ClpArchiveCursor`` or ``ClpIrCursor``) is created with the split path and
+input source; both derive from ``BaseClpCursor``. The query is parsed and simplified into an AST. For timestamp
+columns, CLP‑s’ ``TimestampDictionaryWriter`` may convert string timestamps to integers, which are then converted to
+Velox timestamps via ``ClpTimestampsUtils`` (handles precision normalization).
Please confirm the exact utility/function names you want to cite from ClpTimestampsUtils (e.g., precision handling entry points), and whether the flow always uses TimestampDictionaryWriter or only for string timestamps. Based on learnings.

coderabbitai · 2025-09-29T14:10:17Z

velox/docs/develop/connectors.rst

+``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR
+(Internal Representation) stream.
+


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Fix IR expansion and document allowed type tokens (case‑sensitive).

“IR” is expanded as “Internal Representation”; it should be “Intermediate Representation”. Also, please list the exact literal values accepted by the type property and whether they’re case‑sensitive.

Apply:

-``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR -(Internal Representation) stream. +``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR +(Intermediate Representation) stream. Allowed values: ``archive`` or ``ir`` (case‑sensitive).

To verify the exact tokens/casing against the implementation, run:

🏁 Script executed:

#!/bin/bash # Find split type parsing/validation and accepted literals. rg -nP -C3 '(ClpConnectorSplit|Split).*type' --glob 'velox/**' rg -nP -C3 '\b(archive|ir)\b' --type=cpp --type=cc --type=cxx --type=h --glob 'velox/**' rg -nP -C3 '(enum|class)\s+.*Split.*Type|k[A-Z][A-Za-z]*Split' --glob 'velox/**'

Length of output: 31892

Fix IR expansion and document exact enum values for type.

Update “IR (Internal Representation)” to “IR (Intermediate Representation)” and clarify that the type property takes the SplitType enum values: kArchive (0) or kIr (1).

🤖 Prompt for AI Agents

In velox/docs/develop/connectors.rst around lines 138 to 140, update the description of IR and the `type` property: change “IR (Internal Representation)” to “IR (Intermediate Representation)” and explicitly state that the `type` property uses the SplitType enum with values `kArchive` (0) and `kIr` (1); modify the sentence to read something like: "ClpConnectorSplit also includes a `type` property that specifies whether the split is an archive or an IR (Intermediate Representation) stream, using the SplitType enum: `kArchive` (0) or `kIr` (1)." Ensure punctuation and code formatting match surrounding docs.

kirkrodrigues

Approving docs.

anlowee added 30 commits August 19, 2025 17:56

Extract ClpCursor abstract class and move the original ClpCursor impl…

5af7ec3

…ementation to ClpArchiveCursor

Merge the code

fe8a130

Rename the extracrted ClpCursor to BaseClpCursor, and add the skeleto…

cc82dd7

…n of ClpIrCursor

WIP: implement the ClpIrCursor, one todo is to marshal the row in Clp…

65d3bd1

…VectorLoader

Remove the IR related code and add a subpackage to store archive sear…

d45baf0

…ch related code

Address coderabbitai comments

7f40b7e

Address comments

c0cac3d

Update from part1 branch

c7bd83b

WIP

27ba897

Move the current VectorLoader to Archive-specific

19bad61

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

d3fafde

…ion-phase2-part2

Rename and WIP

f53c6c7

Put more archive specific thing into Archive code

e466f68

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

8be8eb7

…ion-phase2-part2

WIP

d6ba0da

WIP

221bec6

Merge branch 'presto-0.293-clp-connector' into xwei/ir-integration-ph…

679d338

…ase2-part1

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

6e1462d

…ion-phase2-part2

Get it work

f8f75c3

Code clean up

1c091c7

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

8ad9b9c

…ion-phase2-part2

Address coderabbitai comments

7f25210

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

28e9a6f

…ion-phase2-part2

Minor fix

595e74b

Fix unitest

1b12402

Fix building issues

5db1cf9

Address comments

e71a957

Merge branch 'xwei/ir-integration-phase2-part1' into xwei/ir-integrat…

cc82c62

…ion-phase2-part2

Try out zhihao fix

3ca7361

Apply Zhihao fix

184f9de

WIP

82778cc

anlowee added 2 commits September 15, 2025 19:28

Fix the unit tests

b257938

Add string type timestamp logevent in test ir file

10e222a

anlowee marked this pull request as ready for review September 15, 2025 20:11

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

Add docs

1928748

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

Address coderabbitai comments

1982c3b

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

Address coderabbitai comments

947fded

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

velox/connectors/clp/search_lib/BaseClpCursor.h Outdated Show resolved Hide resolved

kirkrodrigues requested a review from gibber9809 September 18, 2025 14:10

gibber9809 requested changes Sep 19, 2025

View reviewed changes

Address comments

03d251d

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

Address comment

da66e99

anlowee requested a review from gibber9809 September 25, 2025 18:19

Address comment

ae97766

gibber9809 previously approved these changes Sep 25, 2025

View reviewed changes

anlowee changed the title ~~feat: Add timestamp type support for IR stream~~ feat: Add timestamp type support for kv-ir splits. Sep 25, 2025

Merge branch 'presto-0.293-clp-connector' into xwei/ir-integration-ph…

a6a9e49

…ase2-part3

kirkrodrigues reviewed Sep 29, 2025

View reviewed changes

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

velox/docs/develop/connectors.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

4c3e4e8

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

anlowee dismissed gibber9809’s stale review via 4c3e4e8 September 29, 2025 14:05

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

kirkrodrigues approved these changes Sep 29, 2025

View reviewed changes

anlowee merged commit 12521a1 into y-scope:presto-0.293-clp-connector Sep 29, 2025
7 checks passed

anlowee deleted the xwei/ir-integration-phase2-part3 branch September 29, 2025 16:07

		// TODO: IR timestamp support pending; constrain to Unknown to avoid
		// mismatched projections.

		``ClpConnectorSplit`` also includes a ``type`` property that specifies whether the split is an archive or an IR
		(Internal Representation) stream.

feat: Add timestamp type support for kv-ir splits. #30

feat: Add timestamp type support for kv-ir splits. #30

Uh oh!

Conversation

anlowee commented Sep 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anlowee commented Sep 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

gibber9809 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

anlowee Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

anlowee Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gibber9809 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gibber9809 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 22, 2025

anlowee commented Sep 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 10, 2025 •

edited

Loading