Skip to content

Conversation

anlowee
Copy link

@anlowee anlowee commented Aug 20, 2025

Description

This PR mainly refactors the current ClpCursor to extract an interface BaseClpCursor.

The original implmentation of ClpCursor now is moved to archive/ClpArchiveCursor. Later, we will create another package ir and put the IR cursor implementation there.

There is also a term change, the archive term used by original ClpCursor now is renamed to split, to involve IR concept as well.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Passed the unit tests.
Passed the CI.

Summary by CodeRabbit

  • New Features

    • Archive-split support for CLP data sources to query archived inputs.
    • Unified path handling for local and S3 sources with automatic URL/auth selection.
  • Bug Fixes

    • Clearer error messages for unsupported or unknown storage types.
  • Refactor

    • Split-oriented cursor lifecycle and delegated vector materialization while preserving push-down semantics.
  • Chores

    • Build reorganisation and CLP dependency update.

Copy link

coderabbitai bot commented Aug 20, 2025

Walkthrough

Adds split-oriented CLP support: introduces BaseClpCursor and ClpArchiveCursor, moves vector loading to ClpArchiveVectorLoader, updates ClpDataSource to instantiate archive cursors for kArchive splits, reorganizes search_lib into an archive subtarget, and updates the CLP dependency pin.

Changes

Cohort / File(s) Summary
Data source integration
velox/connectors/clp/ClpDataSource.cpp, velox/connectors/clp/ClpDataSource.h
Replace ClpCursor with BaseClpCursor; route kArchive splits to ClpArchiveCursor; unify splitPath/inputSource handling (Filesystem vs Network/S3); remove createVector(...) from data source and delegate vector creation to cursor.
Cursor base abstraction
velox/connectors/clp/search_lib/BaseClpCursor.h, velox/connectors/clp/search_lib/BaseClpCursor.cpp
Add BaseClpCursor split-oriented abstract base: constructor takes splitPath, abstract fetchNext(uint64_t), getNumFilteredRows(), createVector(...), and loadSplit(); implement executeQuery and query preprocessing scaffolding.
Archive cursor and helpers
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h, velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp, velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h, velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp, velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp
Add ClpArchiveCursor deriving from BaseClpCursor with loadSplit(), simplified fetchNext, getNumFilteredRows, and createVector(...) using ClpArchiveVectorLoader. Migrate ClpVectorLoaderClpArchiveVectorLoader (const-correct filtered rows, bounds-checked access). Update ClpQueryRunner include path.
Search lib build changes
velox/connectors/clp/search_lib/CMakeLists.txt
Replace ClpCursor sources with BaseClpCursor sources; remove ClpQueryRunner and ClpVectorLoader from this target; add add_subdirectory(archive) and link clp-s-archive-search publicly.
Archive sublibrary
velox/connectors/clp/search_lib/archive/CMakeLists.txt
New clp-s-archive-search static target containing ClpArchiveCursor, ClpArchiveVectorLoader, ClpQueryRunner, and related sources; set public/private dependencies and C++20.
Dependency pin
CMake/resolve_dependency_modules/clp.cmake
Update CLP git tag from 19cd534e6... to 0de99a9e8....

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Exec as Exec Plan
  participant DS as ClpDataSource
  participant Cur as ClpArchiveCursor
  participant QR as ClpQueryRunner
  participant AR as ArchiveReader

  Exec->>DS: addSplit(split: kArchive, path)
  DS->>Cur: construct(inputSource, splitPath)
  Cur->>AR: loadSplit(): open_packed_streams(splitPath)
  Cur->>QR: init(query, reader, projection)

  Exec->>DS: next(numRows)
  DS->>Cur: fetchNext(numRows)
  Cur-->>DS: rowsFetched (count)
  DS->>Cur: createVector(pool, vectorType, vectorSize)
  Cur->>Cur: build vectors via ClpArchiveVectorLoader
  Cur-->>DS: RowVector
  DS-->>Exec: RowVector
Loading
sequenceDiagram
  autonumber
  actor Exec as Exec Plan
  participant DS as ClpDataSource
  participant B as BaseClpCursor

  Exec->>DS: push-down query + outputColumns
  DS->>B: executeQuery(query, outputColumns)
  B->>B: preprocessQuery(): parse KQL -> transforms (OrOfAndForm, NarrowTypes, ConvertToExists)
  alt parse/transform fails
    B-->>DS: errorCode != Success
  else
    B-->>DS: Success
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@anlowee anlowee marked this pull request as ready for review August 20, 2025 16:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)

14-20: CMake target wiring LGTM; minor optional improvement for headers

The target_sources and include dirs look correct. Optional: expose headers via a FILE_SET to improve IDE integration and installation:

target_sources(clp-s-search
  PRIVATE ClpArchiveCursor.cpp ClpQueryRunner.cpp
  PUBLIC FILE_SET HEADERS
    FILES ClpArchiveCursor.h ClpQueryRunner.h
)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)

68-73: Update doc to be split-agnostic.

The comment still says “archive”, but BaseClpCursor now abstracts split handling (archive and future IR). Consider rewording to “split”.

Suggested rewording:
“A query execution interface that manages the lifecycle of a query on a CLP-S split …”

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)

67-73: Don’t set errorCode_ on per-schema failure; it contaminates subsequent calls

Setting errorCode_ = DictionaryNotFound on a single schema_init failure can cause future fetchNext calls to short-circuit even if later schemas succeed. The code already resets errorCode_ on successful schema table load; better to avoid setting an error for per-schema skips.

Apply this diff:

       if (EvaluatedValue::False ==
           queryRunner_->schema_init(currentSchemaId_)) {
         currentSchemaIndex_ += 1;
         currentSchemaTableLoaded_ = false;
-        errorCode_ = ErrorCode::DictionaryNotFound;
         continue;
       }

109-116: Open packed streams and set loaded-flag inside loadSplit; close on failure; update log message

Centralise initialisation and make it exception-safe. This also aligns the terminology with the split-centric model.

Apply this diff:

-  try {
-    archiveReader_->open(
-        get_path_object_for_raw_path(splitPath_), networkAuthOption);
-  } catch (std::exception& e) {
-    VLOG(2) << "Failed to open archive file: " << e.what();
-    return ErrorCode::InternalError;
-  }
+  try {
+    archiveReader_->open(
+        get_path_object_for_raw_path(splitPath_), networkAuthOption);
+    archiveReader_->open_packed_streams();
+    currentSplitLoaded_ = true;
+  } catch (std::exception& e) {
+    // Ensure we don’t leak a partially opened archive.
+    archiveReader_->close();
+    VLOG(2) << "Failed to open split: " << e.what();
+    return ErrorCode::InternalError;
+  }

198-202: Remove duplicate timestamp index evaluation

EvaluateTimestampIndex is already run earlier (Lines 121–126). Re-running it here is redundant and wastes cycles.

Apply this diff:

-  EvaluateTimestampIndex timestamp_index(timestampDict);
-  if (EvaluatedValue::False == timestamp_index.run(expr_)) {
-    VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
-  }

128-131: Improve readability by separating assignment from condition

Using an init-statement with an assignment inside the if hampers readability. Separate the steps.

Apply this diff:

-  if (expr_ = schemaMatch_->run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  expr_ = schemaMatch_->run(expr_);
+  if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
     VLOG(2) << "No matching schemas for query '" << query_ << "'";
     return ErrorCode::SchemaNotFound;
   }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d78134d and d45baf0.

📒 Files selected for processing (11)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/ClpDataSource.h (2 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (3 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.h (3 hunks)
  • velox/connectors/clp/search_lib/archive/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (6)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (25-29)
  • clp_s (31-34)
  • clp_s (36-38)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
  • facebook (37-90)
  • vector (76-78)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • ClpQueryRunner (43-88)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)
  • ClpArchiveCursor (32-36)
  • ClpArchiveCursor (38-42)
  • fetchNext (44-93)
  • fetchNext (44-46)
  • getProjectedColumns (95-102)
  • getProjectedColumns (96-96)
  • loadSplit (104-211)
  • loadSplit (104-104)
velox/connectors/clp/search_lib/BaseClpCursor.h (4)
  • BaseClpCursor (73-140)
  • clp_s (25-29)
  • clp_s (31-34)
  • clp_s (36-38)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (25-29)
  • clp_s (31-34)
  • clp_s (36-38)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (3)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
  • BaseClpCursor (73-140)
velox/connectors/clp/ClpDataSource.cpp (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
  • splitPath (67-73)
  • splitPath (67-70)
velox/connectors/clp/search_lib/BaseClpCursor.h (5)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • SplitType (42-47)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
  • fetchNext (44-93)
  • fetchNext (44-46)
  • getProjectedColumns (95-102)
  • getProjectedColumns (96-96)
  • loadSplit (104-211)
  • loadSplit (104-104)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
  • vector (76-78)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
  • preprocessQuery (41-76)
  • preprocessQuery (41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (15)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)

17-17: Local include path change looks good

Switching to a local include is consistent with the new archive target exposing its include dir. No functional impact.

velox/connectors/clp/ClpDataSource.h (1)

105-105: Switch to BaseClpCursor pointer aligns with the new split-centric design

Using the base interface here is the right abstraction and keeps the header independent of archive/IR-specific implementations.

velox/connectors/clp/search_lib/CMakeLists.txt (4)

17-18: Good modularization: Base cursor added to target.

Adding BaseClpCursor.{h,cpp} to clp-s-search aligns with the new split-centric design. No issues spotted.


26-27: Include directory scope: confirm consumers see search_lib headers.

Making ${CMAKE_CURRENT_SOURCE_DIR} public is correct for exposing search_lib headers. Ensure any targets including headers as "search_lib/..." or "archive/..." have this directory (and archive/) on their include path via target linkage to clp-s-search or equivalent. Otherwise, prefer absolute includes (velox/connectors/clp/...) in consumers.

Would you like me to scan the repo for relative includes that rely on this particular include path and propose normalizations?


28-29: Subtarget integration looks correct.

Including the archive subdirectory keeps responsibilities separated and the build tidy. Good call.


30-39: Link interface looks complete; C++20 requirement is fine.

Linking clp_s::{archive_reader,clp_dependencies,io,search,search::kql} and velox_vector matches the usage in BaseClpCursor/ClpArchiveCursor. The cxx_std_20 requirement covers the if-with-initializer used in BaseClpCursor.cpp. LGTM.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)

41-48: Robust parse failure handling is good.

Parsing KQL via parse_kql_expression and returning InvalidQuerySyntax on nullptr is correct. VLOG(2) level is appropriate for diagnostics.


68-75: Final pass and success path look good.

ConvertToExists as the final pass and returning Success when not EmptyExpr is consistent with the described pipeline.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)

30-52: Interface and state layout are coherent.

Overrides for fetchNext/getProjectedColumns and split loading hooks match the base contract. Member state mirrors the archive scanning lifecycle. Looks good.

velox/connectors/clp/ClpDataSource.cpp (3)

106-115: S3 URL construction path looks correct; consider guarding against null auth provider.

The FS/Network selection and S3 URL construction make sense. If ClpConfig::kS3 is set, ensure s3AuthProvider_ is not null to avoid a crash.

Optionally add:

 } else if (ClpConfig::StorageType::kS3 == storageType_) {
   inputSource = clp_s::InputSource::Network;
-  splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);
+  VELOX_CHECK_NOT_NULL(s3AuthProvider_, "S3 storage selected but S3 auth provider is null");
+  splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);

117-122: Explicit split-type gating is good.

Routing kArchive splits to ClpArchiveCursor and failing fast on unsupported types provides a clear error surface for future IR integration.


166-170: Passing splitType to ClpVectorLoader is correct.

Threading cursor_->getSplitType() into the loader matches the new split-aware loader signature. LGTM.

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

135-139: Private preprocessQuery hook is correct.

Keeping preprocessQuery private to the base and invoked via executeQuery centralizes validation. No issues.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)

19-20: Include rename aligns with split-centric refactor — LGTM

Header inclusion reflects the new archive-scoped cursor class and is consistent with the PR direction.


95-102: getProjectedColumns fallback is safe and efficient — LGTM

Returning a static empty vector avoids dangling references and extra allocations when the runner isn’t initialised.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)

245-287: Fix lifetime bug: don’t stash string_view to temporary JSON strings

to_json_string(...).value() returns a temporary std::string. Storing std::string_view to it produces a dangling reference. Use std::string to hold the values until set() copies into the vector’s string buffer.

-        std::vector<std::string_view> arrayElements;
-        for (auto arrayElement : array) {
-          arrayElements.emplace_back(
-              simdjson::to_json_string(arrayElement).value());
-        }
+        std::vector<std::string> arrayElements;
+        for (auto value : array) {
+          auto json = simdjson::to_json_string(value);
+          if (json.error()) {
+            VELOX_FAIL(
+                "Failed to serialize JSON array element at row {}: {}",
+                vectorIndex,
+                json.error_message());
+          }
+          arrayElements.emplace_back(std::move(json.value()));
+        }
@@
-        arrayVector->setOffsetAndSize(
-            vectorIndex, elementIndex, arrayElements.size());
-        for (auto& arrayElement : arrayElements) {
-          elements->set(elementIndex++, StringView(arrayElement));
-        }
+        arrayVector->setOffsetAndSize(
+            vectorIndex, elementIndex, arrayElements.size());
+        for (const auto& elem : arrayElements) {
+          elements->set(elementIndex++, StringView(elem));
+        }
velox/connectors/clp/search_lib/BaseClpCursor.h (1)

61-66: Update docstring to reflect “split”-centric design.

The class comment still speaks to “archive” explicitly. Given this is the split-agnostic base, consider updating terminology to “split” (archive or IR) to prevent confusion.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)

99-111: Close the archive on any early return after open (RAII guard).

Once open() succeeds, subsequent early returns currently skip close(), leaking the handle. Add a scope guard after a successful open so all non-success returns close the archive automatically.

Option A (preferred, RAII via folly):

 #include <glog/logging.h>
+#include <folly/ScopeGuard.h>

 ...

   try {
     archiveReader_->open(
         get_path_object_for_raw_path(splitPath_), networkAuthOption);
   } catch (std::exception& e) {
     VLOG(2) << "Failed to open archive file: " << e.what();
     return ErrorCode::InternalError;
   }

+  // Ensure we close the archive on any early return after a successful open.
+  auto closeGuard = folly::makeGuard([&]() noexcept { archiveReader_->close(); });

   auto timestampDict = archiveReader_->get_timestamp_dictionary();

Later, just before returning Success:

   queryRunner_->global_init();
-  return ErrorCode::Success;
+  closeGuard.dismiss();
+  return ErrorCode::Success;

Option B (no extra dependency): explicitly call close() before each early return after open(). For example:

   if (clp_s::EvaluatedValue::False == timestampIndex.run(expr_)) {
     VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
+    archiveReader_->close();
+    return ErrorCode::InvalidTimestampRange;
   }

Repeat for other error returns after open (SchemaNotFound, InternalError in projection loop, empty matchedSchemas, second timestamp check, etc.).


123-127: Improve readability by separating assignment from the condition.

The if-init with assignment is valid but harder to read. Splitting it clarifies intent.

Apply:

-  if (expr_ = schemaMatch_->run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  expr_ = schemaMatch_->run(expr_);
+  if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
     VLOG(2) << "No matching schemas for query '" << query_ << "'";
     return ErrorCode::SchemaNotFound;
   }

193-197: Drop redundant second timestamp-range evaluation (minor perf/clarity).

The timestamp index is evaluated twice against the same dictionary and expr_. The first check already early-outs before heavier work; the second appears redundant.

Apply:

-  EvaluateTimestampIndex timestamp_index(timestampDict);
-  if (EvaluatedValue::False == timestamp_index.run(expr_)) {
-    VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
-  }

Alternatively, add a brief comment justifying the re-check if there’s a subtle reason to run it again.

♻️ Duplicate comments (9)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)

128-134: Make filteredRowIndices const to encode read-only semantics

The loader only reads indices. Using std::shared_ptr<const std::vector<uint64_t>> prevents accidental mutation.

 ClpVectorLoader::ClpVectorLoader(
     clp_s::BaseColumnReader* columnReader,
     ColumnType nodeType,
-    std::shared_ptr<std::vector<uint64_t>> filteredRowIndices,
+    std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices,
     ClpConnectorSplit::SplitType splitType)
     : columnReader_(columnReader),
       nodeType_(nodeType),
       filteredRowIndices_(std::move(filteredRowIndices)),
       splitType_(splitType) {}

Note: Update the declaration and member type in ClpVectorLoader.h accordingly (see header comment).

velox/connectors/clp/search_lib/ClpVectorLoader.h (2)

23-23: Fix include path to ClpConnectorSplit.h to match project conventions

Omitting the velox/ prefix will break builds when the include root isn’t the velox folder.

-#include "connectors/clp/ClpConnectorSplit.h"
+#include "velox/connectors/clp/ClpConnectorSplit.h"

45-47: Const-qualify filteredRowIndices and mark splitType_ maybe_unused (until used)

  • filteredRowIndices is read-only; make it std::shared_ptr<const std::vector<uint64_t>>.
  • splitType_ is currently stored but unused; silence potential warnings and document intent.
   ClpVectorLoader(
       clp_s::BaseColumnReader* columnReader,
       ColumnType nodeType,
-      std::shared_ptr<std::vector<uint64_t>> filteredRowIndices,
+      std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices,
       ClpConnectorSplit::SplitType splitType);
@@
-  std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_;
-  ClpConnectorSplit::SplitType splitType_;
+  std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices_;
+  [[maybe_unused]] ClpConnectorSplit::SplitType splitType_;

Optional: add a short TODO near splitType_ describing the planned use (selecting marshaling/population strategies by split type).

Also applies to: 65-67

velox/connectors/clp/search_lib/BaseClpCursor.cpp (3)

17-24: Add missing include for std::istringstream

This TU uses std::istringstream but doesn’t include , causing a compile error on stricter compilers.

 #include "velox/connectors/clp/search_lib/BaseClpCursor.h"
 
+#include <sstream>
+
 #include "clp_s/search/ast/ConvertToExists.hpp"
 #include "clp_s/search/ast/EmptyExpr.hpp"
 #include "clp_s/search/ast/NarrowTypes.hpp"
 #include "clp_s/search/ast/OrOfAndForm.hpp"
 #include "clp_s/search/kql/kql.hpp"

Also applies to: 39-41


31-37: Consider returning ErrorCode from executeQuery to enable call-site short-circuiting

Returning errorCode_ makes it trivial to skip fetchNext() on failure and simplifies tests.

-void BaseClpCursor::executeQuery(
+ErrorCode BaseClpCursor::executeQuery(
     const std::string& query,
     const std::vector<Field>& outputColumns) {
   query_ = query;
   outputColumns_ = outputColumns;
   errorCode_ = preprocessQuery();
+  return errorCode_;
 }

Follow-up: update the declaration in BaseClpCursor.h and call sites.


52-57: If-initializer idiom is valid; consider a tiny readability tweak

The assignment-in-if pattern is fine in C++17. If you prefer, split assignment and condition or add a brief comment—purely optional.

Also applies to: 59-64, 66-71

velox/connectors/clp/search_lib/BaseClpCursor.h (2)

22-23: Add missing standard headers for used types (self-contained header hygiene).

This header uses std::string_view, std::shared_ptr, and uint64_t but doesn’t include <string_view>, , or . Headers should be self-contained to avoid brittle transitive-includes.

Apply this diff near the existing includes:

 #include <string>
 #include <vector>
+#include <cstdint>
+#include <memory>
+#include <string_view>

 #include "clp_s/InputConfig.hpp"
 #include "velox/connectors/clp/ClpConnectorSplit.h"

104-109: Allow derived classes to set split type without mutating members directly.

Expose a protected setter (or make splitType_ const and pass via base ctor). This avoids direct member mutation in derived constructors and clarifies intended usage.

Apply this diff in the protected section:

  /// @return The error code.
   virtual ErrorCode loadSplit() = 0;

+  // Allow derived classes to set the split type explicitly.
+  void setSplitType(ClpConnectorSplit::SplitType t) {
+    splitType_ = t;
+  }

Optionally, assert it’s set once if you expect immutability after construction.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)

206-211: Good: moved open_packed_streams into loadSplit and set loaded flag after success.

This centralises initialisation and reduces leak/consistency risks. With the earlier RAII guard, failure cases will be fully safe as well.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d45baf0 and 0d064d5.

📒 Files selected for processing (7)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (3 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp (2 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.h (4 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)

32-33: LGTM: Forward declaration for ClpQueryRunner fixes inverted dependency

This local forward-declare avoids forcing BaseClpCursor to include archive headers. Good dependency hygiene.


36-39: Ensure splitPath’s lifetime is owned downstream

Taking std::string_view is ideal for call-sites; ensure the implementation copies into an owning std::string (as BaseClpCursor does per earlier learning) to avoid dangling references.

Would you like me to scan ClpArchiveCursor.cpp to confirm the splitPath_ storage pattern matches BaseClpCursor’s?

velox/connectors/clp/ClpDataSource.cpp (2)

23-26: LGTM: Normalised include paths to project-rooted form

Consistent velox/…-prefixed includes improve portability across build targets.


166-170: LGTM: Propagating split type into ClpVectorLoader

This wires the split-centric model cleanly and prepares for IR-specific marshaling.

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

68-76: Constructor correctly owns splitPath (good).

Converting std::string_view to std::string in the member initializer avoids dangling references and matches the pattern noted in prior learnings.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)

33-37: Ctor delegation to BaseClpCursor is correct; splitPath ownership is safe.

Forwarding std::string_view to the base that copies into an owning std::string is safe and aligns with the established pattern.

@anlowee anlowee force-pushed the xwei/ir-integration-phase2-part1 branch from 0d064d5 to 081b9ca Compare August 20, 2025 17:51
@anlowee anlowee force-pushed the xwei/ir-integration-phase2-part1 branch from 081b9ca to 7f40b7e Compare August 20, 2025 18:13
@anlowee anlowee requested a review from wraymo August 20, 2025 18:13
Copy link
Member

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the full code of the IR implementation based on the current refactoring code?

@anlowee
Copy link
Author

anlowee commented Aug 20, 2025

Do we have the full code of the IR implementation based on the current refactoring code?

Not yet, that will be in the next PR. But already have part of them, still need to finish the marshal part.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
velox/connectors/clp/ClpDataSource.cpp (2)

125-131: Surface query preprocessing failures to the pipeline (after executeQuery refactor)

If you adopt returning ErrorCode from BaseClpCursor::executeQuery (see my other comment), check it here and fail early with a user error instead of silently returning no rows later.

Example (assuming executeQuery returns ErrorCode):

-  if (pushDownQuery && !pushDownQuery->empty()) {
-    cursor_->executeQuery(*pushDownQuery, fields_);
-  } else {
-    cursor_->executeQuery("*", fields_);
-  }
+  const auto ec = (pushDownQuery && !pushDownQuery->empty())
+      ? cursor_->executeQuery(*pushDownQuery, fields_)
+      : cursor_->executeQuery("*", fields_);
+  VELOX_USER_CHECK_EQ(
+      ec,
+      search_lib::ErrorCode::Success,
+      "Failed to preprocess KQL query (error code {}): {}",
+      static_cast<int>(ec),
+      pushDownQuery ? *pushDownQuery : "*");

176-177: Micro-optimisation: reserve filteredRows capacity

We know the limit upfront (size). Reserving avoids reallocations on selective workloads.

-  auto filteredRows = std::make_shared<std::vector<uint64_t>>();
+  auto filteredRows = std::make_shared<std::vector<uint64_t>>();
+  filteredRows->reserve(size);
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)

116-121: Close the archive on all early-return failure paths to avoid handle leaks

loadSplit() opens the archive, then returns early on several error branches before currentSplitLoaded_ is set, skipping close() in the dtor. Ensure archiveReader_->close() is called before each early return.

Minimal fix (repeat close() before each return shown below):

   EvaluateTimestampIndex timestampIndex(timestampDict);
   if (clp_s::EvaluatedValue::False == timestampIndex.run(expr_)) {
     VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
+    archiveReader_->close();
+    return ErrorCode::InvalidTimestampRange;
   }
 
   schemaMatch_ = std::make_shared<SchemaMatch>(schemaTree, schemaMap);
   if (expr_ = schemaMatch_->run(expr_);
       std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
     VLOG(2) << "No matching schemas for query '" << query_ << "'";
-    return ErrorCode::SchemaNotFound;
+    archiveReader_->close();
+    return ErrorCode::SchemaNotFound;
   }
@@
   } catch (TraceableException& e) {
     VLOG(2) << e.what();
-    return ErrorCode::InternalError;
+    archiveReader_->close();
+    return ErrorCode::InternalError;
   }
@@
   if (matchedSchemas_.empty()) {
-    return ErrorCode::SchemaNotFound;
+    archiveReader_->close();
+    return ErrorCode::SchemaNotFound;
   }
@@
   if (EvaluatedValue::False == timestamp_index.run(expr_)) {
     VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
+    archiveReader_->close();
+    return ErrorCode::InvalidTimestampRange;
   }

Alternatively, add an RAII scope guard (e.g., folly::ScopeGuard) right after open() and release it only after currentSplitLoaded_ is set. I can provide that patch if preferred.

Also applies to: 123-127, 173-176, 189-191, 194-197


136-139: Spellcheck log message (“Cannot” vs “Can not”)

Minor polish for operator logs.

-        VLOG(2) << "Can not tokenize invalid column: '" << column.name << "'";
+        VLOG(2) << "Cannot tokenize invalid column: '" << column.name << "'";
♻️ Duplicate comments (12)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)

24-24: Include is necessary for ColumnType enum values.

Based on previous learnings, this include is required because the implementation uses specific ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in switch statements. While the header has a forward declaration, the full definition from BaseClpCursor.h is needed here.

velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)

14-20: Consider using velox_add_library consistently with project conventions.

The target name clp-s-archive-search uses a hyphenated format instead of the underscore convention typically used in Velox. Consider using clp_s_archive_search for consistency with other Velox library targets.

Apply this diff to align with Velox naming conventions:

-velox_add_library(
-  clp-s-archive-search
+velox_add_library(
+  clp_s_archive_search
   STATIC
   ClpArchiveCursor.cpp
   ClpArchiveCursor.h
   ClpQueryRunner.cpp
   ClpQueryRunner.h)
velox/connectors/clp/search_lib/ClpVectorLoader.h (2)

42-46: Consider const-correctness for filteredRowIndices parameter.

Since the vector is only read from and never modified, consider making the parameter const to enforce read-only semantics and prevent accidental mutations.

Apply this diff to improve const-correctness:

   ClpVectorLoader(
       clp_s::BaseColumnReader* columnReader,
       ColumnType nodeType,
-      std::shared_ptr<std::vector<uint64_t>> filteredRowIndices,
+      std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices,
       ClpConnectorSplit::SplitType splitType);

66-66: Mark unused splitType_ member appropriately.

The splitType_ member is currently unused in the implementation. Consider adding a comment or [[maybe_unused]] attribute to clarify its intended future use.

Apply one of these approaches:

-  ClpConnectorSplit::SplitType splitType_;
+  [[maybe_unused]] ClpConnectorSplit::SplitType splitType_;

Or add a comment explaining its planned use:

-  ClpConnectorSplit::SplitType splitType_;
+  ClpConnectorSplit::SplitType splitType_; // TODO: Used for marshalling strategy selection between archive and IR
velox/connectors/clp/search_lib/BaseClpCursor.cpp (3)

33-39: Return an ErrorCode from executeQuery so callers can short-circuit on failure

Right now, executeQuery sets errorCode_ but offers no programmatic signal to the caller. Returning ErrorCode lets ClpDataSource (and future IR paths) fail fast without entering fetchNext on invalid/unsatisfiable queries. This also simplifies testing.

Apply this local change:

-void BaseClpCursor::executeQuery(
+ErrorCode BaseClpCursor::executeQuery(
     const std::string& query,
     const std::vector<Field>& outputColumns) {
   query_ = query;
   outputColumns_ = outputColumns;
   errorCode_ = preprocessQuery();
+  return errorCode_;
 }

And in BaseClpCursor.h, update the declaration accordingly (see my comment there for a full header diff), and consider checking the return in ClpDataSource::addSplit before proceeding.


54-56: If-initializer idiom is valid; consider readability tweak

The “if (expr_ = pass.run(expr_); dynamic_pointer_cast(expr_))” pattern is correct (C++17), but slightly dense. Consider either a short comment explaining the combined assign+check intent or splitting into two statements for clarity.

Also applies to: 61-63, 69-71


49-52: Deduplicate the repeated “logically false” branches

The four identical “logically false” branches can be DRY’d to a small helper to keep this routine compact.

Example refactor (local to this function):

 ErrorCode BaseClpCursor::preprocessQuery() {
   auto queryStream = std::istringstream(query_);
   expr_ = kql::parse_kql_expression(queryStream);
   if (nullptr == expr_) {
     VLOG(2) << "Failed to parse query '" << query_ << "'";
     return ErrorCode::InvalidQuerySyntax;
   }
-
-  if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
-    VLOG(2) << "Query '" << query_ << "' is logically false";
-    return ErrorCode::LogicalError;
-  }
+  auto isFalse = [this]() {
+    return std::dynamic_pointer_cast<EmptyExpr>(expr_) != nullptr;
+  };
+  auto logFalseAnd = [this]() {
+    VLOG(2) << "Query '" << query_ << "' is logically false";
+    return ErrorCode::LogicalError;
+  };
+  if (isFalse()) {
+    return logFalseAnd();
+  }
 
   OrOfAndForm standardizePass;
-  if (expr_ = standardizePass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
-    VLOG(2) << "Query '" << query_ << "' is logically false";
-    return ErrorCode::LogicalError;
-  }
+  if (expr_ = standardizePass.run(expr_); isFalse()) {
+    return logFalseAnd();
+  }
 
   NarrowTypes narrowPass;
-  if (expr_ = narrowPass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
-    VLOG(2) << "Query '" << query_ << "' is logically false";
-    return ErrorCode::LogicalError;
-  }
+  if (expr_ = narrowPass.run(expr_); isFalse()) {
+    return logFalseAnd();
+  }
 
   ConvertToExists convertPass;
-  if (expr_ = convertPass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
-    VLOG(2) << "Query '" << query_ << "' is logically false";
-    return ErrorCode::LogicalError;
-  }
+  if (expr_ = convertPass.run(expr_); isFalse()) {
+    return logFalseAnd();
+  }
 
   return ErrorCode::Success;
 }

Also applies to: 56-59, 63-66, 70-73

velox/connectors/clp/ClpDataSource.cpp (2)

103-107: Defensive: null-check the split cast before dereference

clpSplit is dereferenced unconditionally; if a non-CLP split slips through, this will crash.

 void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) {
   auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split);
+  VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit");
 
   std::string splitPath = clpSplit->path_;

110-113: Guard S3 path construction behind a non-null auth provider

Avoid a null deref when storageType_ is kS3 and the provider isn’t configured.

   } else if (ClpConfig::StorageType::kS3 == storageType_) {
     inputSource = clp_s::InputSource::Network;
+    VELOX_CHECK_NOT_NULL(
+        s3AuthProvider_, "S3 storage requires a configured S3 auth provider");
     splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);
velox/connectors/clp/search_lib/BaseClpCursor.h (3)

85-87: Return an ErrorCode from executeQuery (and mark nodiscard)

Enable call sites to branch immediately on parse/validation errors; nodiscard nudges correct handling.

-  void executeQuery(
+  [[nodiscard]] ErrorCode executeQuery(
       const std::string& query,
       const std::vector<Field>& outputColumns);

95-98: Prefer a non-owning reference for the output collection

Passing filteredRowIndices as shared_ptr suggests shared ownership where a simple out-parameter suffices. Using std::vector<uint64_t>& reduces heap traffic and clarifies ownership. This change ripples to ClpArchiveCursor and ClpQueryRunner.

-  virtual uint64_t fetchNext(
-      uint64_t numRows,
-      const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices) = 0;
+  virtual uint64_t fetchNext(
+      uint64_t numRows,
+      std::vector<uint64_t>& filteredRowIndices) = 0;

Follow-ups: update ClpArchiveCursor::{fetchNext,getProjectedColumns} signatures and their definitions, ClpQueryRunner::fetchNext, and ClpDataSource::next call site.


106-111: Split type plumbing: add a protected setter or ctor param

Derived cursors (e.g., IR) will need to set splitType_. Add a protected setter or overload the base ctor to take SplitType so derived classes don’t have to mutate internals directly. Also, mark the accessor nodiscard.

-  ClpConnectorSplit::SplitType getSplitType() const {
+  [[nodiscard]] ClpConnectorSplit::SplitType getSplitType() const {
     return splitType_;
   }
+
+ protected:
+  void setSplitType(ClpConnectorSplit::SplitType t) { splitType_ = t; }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0d064d5 and c0cac3d.

📒 Files selected for processing (10)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (3 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp (2 hunks)
  • velox/connectors/clp/search_lib/ClpVectorLoader.h (4 hunks)
  • velox/connectors/clp/search_lib/archive/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/ClpVectorLoader.h
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.

Applied to files:

  • velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-07-11T15:21:09.658Z
Learnt from: gibber9809
PR: y-scope/velox#15
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:22-22
Timestamp: 2025-07-11T15:21:09.658Z
Learning: When analyzing include dependencies, pay special attention to types used in template parameters and enum values like `clp_s::NodeType`, as these may not be immediately obvious in a surface-level review but are essential for compilation.

Applied to files:

  • velox/connectors/clp/search_lib/ClpVectorLoader.cpp
🧬 Code Graph Analysis (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/ClpVectorLoader.h (2)
  • clp_s (29-31)
  • facebook (33-72)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • facebook (37-90)
  • ClpQueryRunner (43-88)
  • vector (76-78)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (27-29)
  • clp_s (31-33)
  • BaseClpCursor (68-135)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)
  • ClpArchiveCursor (33-37)
  • ClpArchiveCursor (39-43)
  • fetchNext (45-88)
  • fetchNext (45-47)
  • getProjectedColumns (90-97)
  • getProjectedColumns (91-91)
  • loadSplit (99-212)
  • loadSplit (99-99)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
  • clp_s (27-29)
  • clp_s (31-33)
velox/connectors/clp/ClpDataSource.cpp (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
  • splitPath (67-73)
  • splitPath (67-70)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (5)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
  • clp_s (21-23)
  • clp_s (25-28)
velox/connectors/clp/search_lib/ClpVectorLoader.h (1)
  • clp_s (29-31)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (3)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
  • clp_s (27-29)
  • clp_s (31-33)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/search_lib/ClpVectorLoader.h (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
  • ColumnType (48-137)
velox/connectors/clp/search_lib/BaseClpCursor.h (5)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • SplitType (42-47)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (61-79)
  • fetchNext (61-63)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
  • fetchNext (45-88)
  • fetchNext (45-47)
  • getProjectedColumns (90-97)
  • getProjectedColumns (91-91)
  • loadSplit (99-212)
  • loadSplit (99-99)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
  • vector (76-78)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
  • preprocessQuery (41-76)
  • preprocessQuery (41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (11)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)

17-17: LGTM! Header path correctly updated for the archive reorganization.

The include path change from the root search_lib to archive-specific directory aligns with the PR's refactoring to move archive-related functionality into a dedicated archive subdirectory.

velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)

125-133: Constructor correctly wired with splitType parameter.

The new splitType parameter is properly integrated into the constructor signature and member initialization. The implementation correctly stores the split type for future use in marshalling decisions between archive and IR formats.

velox/connectors/clp/search_lib/ClpVectorLoader.h (3)

19-19: Include added for simdjson support.

The simdjson.h include is correctly added to support the thread_local arrayParser_ member used for JSON array processing in the implementation.


23-23: Include path correctly uses velox prefix.

The ClpConnectorSplit.h include path correctly uses the full velox/ prefix, maintaining consistency with the project's include conventions.


35-35: Forward declaration properly added for ColumnType.

The forward declaration enables the header to use ColumnType as a parameter type without including the full definition, reducing compilation dependencies.

velox/connectors/clp/search_lib/CMakeLists.txt (3)

17-18: Source file transition correctly implemented.

The replacement of ClpCursor.* with BaseClpCursor.* files correctly reflects the interface extraction described in the PR objectives.


26-26: Archive subdirectory properly integrated.

The add_subdirectory(archive) correctly incorporates the new archive-specific functionality into the build system.


28-28: Linking simplified with archive target.

The link command correctly simplifies to use the new clp-s-archive-search target instead of individual library dependencies, which improves maintainability.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

42-47: Confirm parse_kql_expression never throws; otherwise wrap with try/catch

If kql::parse_kql_expression can throw (e.g., on malformed input), preprocessQuery would allow exceptions to propagate. If it’s documented to return nullptr on failure and not throw, we’re fine. Otherwise, add a try/catch here mirroring the logging style below and return InvalidQuerySyntax.

Would you verify the exception behaviour of parse_kql_expression in clp_s? If it may throw, I can propose a concrete try/catch patch.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)

193-197: Is the second timestamp-index check necessary?

We run EvaluateTimestampIndex before and after schema matching. If the transform cannot introduce new timestamp contradictions, the second pass may be redundant work on large dictionaries.

Would you confirm if schemaMatch_->run(expr_) can change timestamp predicates? If not, we can drop the second check to save cycles.


33-37: Constructor path lifetime: safe as implemented (ack)

Per the retrieved learning, BaseClpCursor copies std::string_view into an owning std::string member. No dangling risk here.

Using std::string_view in the ctor and copying into splitPath_ strikes the right balance between API ergonomics and safety.

@anlowee anlowee marked this pull request as draft August 21, 2025 14:14
@anlowee anlowee marked this pull request as ready for review September 4, 2025 18:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)

60-61: Avoid shadowing Velox’s VectorPtr alias in template parameter.

Rename the template parameter to reduce confusion with facebook::velox::VectorPtr.

Apply:

-template <typename T, typename VectorPtr>
-void populateData(RowSet rows, VectorPtr vector);
+template <typename T, typename VecPtr>
+void populateData(RowSet rows, VecPtr vector);
♻️ Duplicate comments (10)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)

17-17: Include path correction LGTM; aligns with “use velox path”

Matches prior guidance and the new archive layout.

velox/connectors/clp/ClpDataSource.h (1)

23-23: Keep BaseClpCursor.h included here

Required because this header stores std::vector<search_lib::Field>; forward declaration is insufficient. Agree with prior resolution.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)

34-35: Ensure .cpp includes BaseClpCursor.h for ColumnType.

ColumnType is only forward-declared here. The .cpp must include BaseClpCursor.h to use its enum values in switches.

velox/connectors/clp/ClpDataSource.cpp (1)

106-115: Null-check the cast and guard S3 auth provider.

Avoid potential nullptr deref on non-CLP splits and on missing S3 auth provider.

 void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) {
   auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split);
+  VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit");

   std::string splitPath = clpSplit->path_;
   clp_s::InputSource inputSource;
   if (ClpConfig::StorageType::kFs == storageType_) {
     inputSource = clp_s::InputSource::Filesystem;
   } else if (ClpConfig::StorageType::kS3 == storageType_) {
     inputSource = clp_s::InputSource::Network;
+    VELOX_CHECK_NOT_NULL(
+        s3AuthProvider_, "S3 storage requires a configured S3 auth provider");
     splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);
   } else {
     VELOX_UNREACHABLE();
   }
velox/connectors/clp/search_lib/BaseClpCursor.h (1)

70-76: Make base cursor non-copyable/non-movable and dtor noexcept.

Prevents accidental aliasing of IO state; clarifies exception guarantees.

   explicit BaseClpCursor(
       clp_s::InputSource inputSource,
       std::string_view splitPath)
       : errorCode_(ErrorCode::QueryNotInitialized),
         inputSource_(inputSource),
         splitPath_(std::string(splitPath)) {}
-  virtual ~BaseClpCursor() = default;
+  virtual ~BaseClpCursor() noexcept = default;
+
+  BaseClpCursor(const BaseClpCursor&) = delete;
+  BaseClpCursor& operator=(const BaseClpCursor&) = delete;
+  BaseClpCursor(BaseClpCursor&&) = delete;
+  BaseClpCursor& operator=(BaseClpCursor&&) = delete;
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)

38-41: Disable copy/move; cursor manages shared IO state.

Prevents accidental aliasing/double-closing of ArchiveReader.

   explicit ClpArchiveCursor(
       clp_s::InputSource inputSource,
       std::string_view splitPath);
   ~ClpArchiveCursor() override;
+
+  ClpArchiveCursor(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor(ClpArchiveCursor&&) = delete;
+  ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete;

68-69: Hint to callers: mark accessor nodiscard.

The projected-columns ref should not be ignored.

-  const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
+  [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)

34-38: Constructor properly handles string lifetime.

Based on the previous learning about BaseClpCursor, the base class constructor safely converts the std::string_view to an owned std::string member, so there are no lifetime concerns here.


120-122: Fix C++20 designated initializers for C++17 compatibility.

The project uses C++17 standard, but this code uses C++20 designated initializers which will fail to compile.

Apply this fix for C++17 compatibility:

-  auto networkAuthOption = inputSource_ == InputSource::Filesystem
-      ? NetworkAuthOption{.method = AuthMethod::None}
-      : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4};
+  NetworkAuthOption networkAuthOption{};
+  networkAuthOption.method = (inputSource_ == InputSource::Filesystem)
+      ? AuthMethod::None
+      : AuthMethod::S3PresignedUrlV4;

40-44: Resource management issue: archive may leak on error paths.

The destructor only closes the archive if currentSplitLoaded_ is true, but in loadSplit(), the archive can be opened (Line 125-126) and then return early on error before setting currentSplitLoaded_ to true (Line 227). This leaves the archive open without proper cleanup.

Consider either:

  1. Set currentSplitLoaded_ = true immediately after successful open() call
  2. Add RAII wrapper or ensure cleanup on all error paths
  3. Have ArchiveReader provide an isOpen() API as suggested in the previous review

Apply this fix to ensure proper cleanup:

   try {
     archiveReader_->open(
         get_path_object_for_raw_path(splitPath_), networkAuthOption);
+    currentSplitLoaded_ = true;  // Mark as loaded immediately after successful open
   } catch (std::exception& e) {
     VLOG(2) << "Failed to open archive file: " << e.what();
     return ErrorCode::InternalError;
   }

Then move the stream opening earlier:

   archiveReader_->read_array_dictionary();
+  archiveReader_->open_packed_streams();
 
   currentSchemaIndex_ = 0;
   currentSchemaTableLoaded_ = false;
-
-  archiveReader_->open_packed_streams();
-  currentSplitLoaded_ = true;

Also applies to: 226-227

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c0cac3d and 1c091c7.

📒 Files selected for processing (11)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/ClpDataSource.h (2 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (3 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (8 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2 hunks)
  • velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.

Applied to files:

  • velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
  • velox/connectors/clp/ClpDataSource.h
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
  • velox/connectors/clp/ClpDataSource.h
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T17:50:58.584Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/ClpDataSource.h:23-23
Timestamp: 2025-08-20T17:50:58.584Z
Learning: In C++ headers, you cannot forward-declare types that are used as value members (like std::vector<Field> fields_) because the compiler needs the complete type definition to calculate the containing class size. Forward declarations only work for pointers and references to types.

Applied to files:

  • velox/connectors/clp/ClpDataSource.h
🧬 Code graph analysis (6)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
  • ClpArchiveVectorLoader (125-131)
velox/vector/LazyVector.h (1)
  • VectorLoader (177-210)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
  • clp_s (27-29)
  • clp_s (31-33)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
  • clp_s (28-30)
  • facebook (32-69)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • facebook (37-90)
  • ClpQueryRunner (43-88)
  • vector (76-78)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (14)
  • ClpArchiveCursor (34-38)
  • ClpArchiveCursor (40-44)
  • fetchNext (46-90)
  • fetchNext (46-46)
  • getNumFilteredRows (92-94)
  • getNumFilteredRows (92-92)
  • createVector (96-108)
  • createVector (96-99)
  • loadSplit (119-232)
  • loadSplit (119-119)
  • getProjectedColumns (110-117)
  • getProjectedColumns (111-111)
  • createVectorHelper (234-264)
  • createVectorHelper (234-238)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
  • ClpArchiveVectorLoader (39-67)
velox/connectors/clp/ClpDataSource.cpp (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
  • splitPath (67-73)
  • splitPath (67-70)
velox/connectors/clp/ClpDataSource.h (1)
  • completedRows_ (81-81)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (27-29)
  • clp_s (31-33)
  • currentSplitLoaded_ (122-137)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
  • clp_s (21-23)
  • clp_s (25-28)
  • readerIndex_ (65-65)
  • getNumFilteredRows (45-58)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • vector (76-78)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/search_lib/BaseClpCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
  • fetchNext (46-90)
  • fetchNext (46-46)
  • getNumFilteredRows (92-94)
  • getNumFilteredRows (92-92)
  • loadSplit (119-232)
  • loadSplit (119-119)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
  • getNumFilteredRows (45-58)
  • clp_s (21-23)
  • clp_s (25-28)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
  • preprocessQuery (41-76)
  • preprocessQuery (41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (19)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)

17-17: No stale includes remain – verified there are no remaining velox/connectors/clp/search_lib/ClpQueryRunner.h includes.

velox/connectors/clp/ClpDataSource.h (1)

86-86: Verified cursor construction and no createVector dependencies

ClpDataSource.cpp correctly instantiates ClpArchiveCursor for archive splits (line 119), and no external calls to the removed createVector remain.

velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)

14-22: Target definition looks good.

New archive subtarget and sources are wired up cleanly.

velox/connectors/clp/search_lib/CMakeLists.txt (3)

17-18: Switch to BaseClpCursor sources looks correct.

Good replacement of ClpCursor.* with BaseClpCursor.*.


24-24: Subdirectory integration is fine.

archive subproject is added before linking and scopes dependencies properly.


26-26: Linkage validated; no stale ClpVectorLoader/ClpQueryRunner references remain

velox/connectors/clp/ClpDataSource.cpp (3)

24-25: LGTM: include path normalization.

Using project-absolute headers is consistent with the rest of the file.


117-123: Archive routing looks correct; IR rejection is explicit.

VELOCX_UNSUPPORTED for non-archive split types is appropriate pending IR work.


136-144: Clarify completedRows_ accounting and use static_pointer_cast

  • Confirm and document whether completedRows_ is meant to track fetched rows (rowsScanned) or output rows (rowsFiltered) and update accordingly:
    - completedRows_ += rowsScanned;
    + completedRows_ += rowsFiltered;  // or retain rowsScanned if tracking fetched rows
  • Replace the RTTI‐based cast with a compile-time cast:
    - return std::dynamic_pointer_cast<RowVector>(
    + return std::static_pointer_cast<RowVector>(
          cursor_->createVector(pool_, outputType_, rowsFiltered));
velox/connectors/clp/search_lib/BaseClpCursor.h (3)

19-22: LGTM: added standard headers.

Adding and <string_view> resolves transitive-dependency brittleness.


63-68: Docstring update reads well.

The split-centric phrasing (archive or IR) matches the new design.


24-26: Remove unused dependency on ClpConnectorSplit.h.

This header doesn’t reference ClpConnectorSplit; drop the include to decouple the interface.

-#include "velox/connectors/clp/ClpConnectorSplit.h"

Likely an incorrect or invalid review comment.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (4)

143-143: Good improvement using bounds-checked access.

The change from operator[] to at() for accessing filteredRowIndices_ provides runtime bounds checking, which helps catch index-out-of-bounds errors early and improves safety.

Also applies to: 180-180


128-128: LGTM! Constructor now properly accepts const reference.

The change to const std::shared_ptr<std::vector<uint64_t>> is a good improvement that prevents unintended modifications to the shared pointer itself while still allowing the underlying vector to be modified through the shared pointer.


125-131: Class migration from ClpVectorLoader to ClpArchiveVectorLoader completed successfully.

All method implementations have been properly updated to use the new ClpArchiveVectorLoader namespace, including the constructor, populateData, populateTimestampData, and loadInternal methods. The explicit template instantiations at the end of the file have also been correctly updated.

Also applies to: 134-156, 159-201, 203-308, 311-338


244-244: Add micro-benchmarks for bounds-checked access
No existing benchmarks cover ClpArchiveVectorLoader’s use of at() in the hot loop. Please add or run performance tests to confirm the bounds-check overhead is acceptable.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)

46-90: Clear and correct implementation of fetchNext.

The method properly initializes state, handles error conditions, and iterates through matched schemas with appropriate loading and filtering logic.


92-108: Well-designed vector creation with proper validation.

The createVector method properly validates that projected columns match output columns before delegating to the helper function. Good defensive programming.


234-264: Clever recursive ROW vector construction with LazyVector support.

The createVectorHelper implementation elegantly handles nested ROW types through recursion and properly wraps leaf vectors with LazyVector for deferred loading. The use of readerIndex_ to track position across recursive calls is well-designed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)

64-69: Don’t poison the cursor with an error for a single schema-init failure.

Setting errorCode_ here causes subsequent fetchNext() calls to short‑circuit even if other schemas are fine. Just skip the schema and continue; set an error only if all schemas fail.

Apply:

-        errorCode_ = ErrorCode::DictionaryNotFound;

124-131: Close the archive on all failure paths (RAII guard).

loadSplit() opens the archive, then returns early on several errors without closing. Since the dtor closes only when currentSplitLoaded_ is true, handles can leak.

Apply:

@@
-  try {
-    archiveReader_->open(
-        get_path_object_for_raw_path(splitPath_), networkAuthOption);
+  try {
+    archiveReader_->open(
+        get_path_object_for_raw_path(splitPath_), networkAuthOption);
+    auto closeOnError = true;
+    auto cleanup = folly::makeGuard([&] {
+      if (closeOnError) {
+        archiveReader_->close();
+      }
+    });
@@
-    return ErrorCode::InvalidTimestampRange;
+    return ErrorCode::InvalidTimestampRange;
   }
@@
-    return ErrorCode::SchemaNotFound;
+    return ErrorCode::SchemaNotFound;
   }
@@
-        return ErrorCode::InternalError;
+        return ErrorCode::InternalError;
       }
@@
-    return ErrorCode::SchemaNotFound;
+    return ErrorCode::SchemaNotFound;
   }
@@
-    return ErrorCode::InvalidTimestampRange;
+    return ErrorCode::InvalidTimestampRange;
   }
@@
-  archiveReader_->open_packed_streams();
-  currentSplitLoaded_ = true;
+  archiveReader_->open_packed_streams();
+  currentSplitLoaded_ = true;
+  cleanup.dismiss(); // success: keep archive open

And add the include:

 #include <glog/logging.h>
+#include <folly/ScopeGuard.h>

Also applies to: 139-147, 159-160, 194-196, 209-211, 215-217, 219-231


143-146: If-initializer assignment is valid; consider readability.

Optional: split assignment and condition for clarity.

Apply:

-  if (expr_ = schemaMatch_->run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  expr_ = schemaMatch_->run(expr_);
+  if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {

213-217: Naming consistency.

Use camelCase consistently: rename timestamp_index → timestampIndex.

-  EvaluateTimestampIndex timestamp_index(timestampDict);
-  if (EvaluatedValue::False == timestamp_index.run(expr_)) {
+  EvaluateTimestampIndex timestampIndex(timestampDict);
+  if (EvaluatedValue::False == timestampIndex.run(expr_)) {
velox/connectors/clp/ClpDataSource.cpp (1)

124-129: Propagate query preprocessing failure immediately.

BaseClpCursor tracks an ErrorCode but there’s no check here; callers won’t see failures until later as “no rows”. Add a status accessor and check it.

   auto pushDownQuery = clpSplit->kqlQuery_;
   if (pushDownQuery && !pushDownQuery->empty()) {
     cursor_->executeQuery(*pushDownQuery, fields_);
   } else {
     cursor_->executeQuery("*", fields_);
   }
+
+  // Fail early if query preprocessing failed.
+  VELOX_CHECK_EQ(
+      search_lib::ErrorCode::Success,
+      cursor_->getErrorCode(),
+      "CLP query initialization failed (error code {}).",
+      static_cast<int>(cursor_->getErrorCode()));

Add the accessor to BaseClpCursor (see separate comment).

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

84-87: Expose execution status to callers.

Add a trivial accessor so addSplit() can fail early on invalid queries.

   void executeQuery(
       const std::string& query,
       const std::vector<Field>& outputColumns);
+
+  [[nodiscard]] ErrorCode getErrorCode() const noexcept {
+    return errorCode_;
+  }
♻️ Duplicate comments (9)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)

119-127: Avoid C++20-only designated initializers unless the whole build targets C++20.

Either keep the target on C++17 and use a compatible init, or make C++20 a PUBLIC feature for all dependants. Safer is to stick to C++17 here.

Apply:

-  auto networkAuthOption = inputSource_ == InputSource::Filesystem
-      ? NetworkAuthOption{.method = AuthMethod::None}
-      : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4};
+  NetworkAuthOption networkAuthOption{};
+  networkAuthOption.method = (inputSource_ == InputSource::Filesystem)
+      ? AuthMethod::None
+      : AuthMethod::S3PresignedUrlV4;
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

34-40: Return the status from executeQuery for easier short‑circuiting.

Returning ErrorCode improves ergonomics at call sites.

-void BaseClpCursor::executeQuery(
+ErrorCode BaseClpCursor::executeQuery(
     const std::string& query,
     const std::vector<Field>& outputColumns) {
   query_ = query;
   outputColumns_ = outputColumns;
   errorCode_ = preprocessQuery();
+  return errorCode_;
 }

Note: update the declaration and call sites accordingly.

velox/connectors/clp/ClpDataSource.cpp (2)

102-104: Null-check the split cast before dereference.

Avoid a crash if a non-CLP split slips through.

 void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) {
   auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split);
+  VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit");

109-112: Guard against missing S3 auth provider in S3 mode.

Fail fast with a clear message instead of NPE.

   } else if (ClpConfig::StorageType::kS3 == storageType_) {
     inputSource = clp_s::InputSource::Network;
+    VELOX_CHECK_NOT_NULL(
+        s3AuthProvider_, "S3 storage requires a configured S3 auth provider");
     splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);
velox/connectors/clp/search_lib/BaseClpCursor.h (2)

70-76: Make the base cursor non-copyable/non-movable and dtor noexcept.

Prevents accidental aliasing of IO state; clarifies exception guarantees.

   explicit BaseClpCursor(
       clp_s::InputSource inputSource,
       std::string_view splitPath)
       : errorCode_(ErrorCode::QueryNotInitialized),
         inputSource_(inputSource),
         splitPath_(std::string(splitPath)) {}
-  virtual ~BaseClpCursor() = default;
+  virtual ~BaseClpCursor() noexcept = default;
+
+  BaseClpCursor(const BaseClpCursor&) = delete;
+  BaseClpCursor& operator=(const BaseClpCursor&) = delete;
+  BaseClpCursor(BaseClpCursor&&) = delete;
+  BaseClpCursor& operator=(BaseClpCursor&&) = delete;

95-100: Mark row-count accessor nodiscard.

Helps catch ignored results at call sites.

-  virtual size_t getNumFilteredRows() const = 0;
+  [[nodiscard]] virtual size_t getNumFilteredRows() const = 0;
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)

37-41: Disable copy/move; optionally make dtor noexcept.

Avoid shared ArchiveReader/query state aliasing.

 class ClpArchiveCursor final : public BaseClpCursor {
  public:
   explicit ClpArchiveCursor(
       clp_s::InputSource inputSource,
       std::string_view splitPath);
-  ~ClpArchiveCursor() override;
+  ~ClpArchiveCursor() noexcept override;
+
+  ClpArchiveCursor(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor(ClpArchiveCursor&&) = delete;
+  ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete;

68-74: Mark projected-columns accessor nodiscard.

It’s almost always consumed; this catches accidental ignores.

-  const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
+  [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;

60-61: Avoid shared_ptr for an internal scratch vector.

Ownership is local; pass a reference to reduce heap/refcount overhead. This change ripples to ClpQueryRunner.

-  std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_ =
-      std::make_shared<std::vector<uint64_t>>();
+  std::vector<uint64_t> filteredRowIndices_;

Update signatures:

  • ClpQueryRunner::fetchNext(uint64_t, std::vector<uint64_t>&)
  • ClpArchiveVectorLoader ctor to take std::vector<uint64_t>& or const&
  • All call sites to pass the vector by reference
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1c091c7 and 7f25210.

📒 Files selected for processing (6)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/BaseClpCursor.h (3 hunks)
  • velox/connectors/clp/search_lib/archive/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.

Applied to files:

  • velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/connectors/clp/search_lib/BaseClpCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/ClpDataSource.cpp
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/BaseClpCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/ClpDataSource.cpp
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/BaseClpCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/ClpDataSource.cpp
🧬 Code graph analysis (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
  • fetchNext (46-90)
  • fetchNext (46-46)
  • getNumFilteredRows (92-94)
  • getNumFilteredRows (92-92)
  • loadSplit (119-232)
  • loadSplit (119-119)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
  • getNumFilteredRows (45-58)
  • clp_s (21-23)
  • clp_s (25-28)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
  • preprocessQuery (42-77)
  • preprocessQuery (42-42)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
  • clp_s (27-29)
  • clp_s (31-33)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
  • clp_s (21-23)
  • clp_s (25-28)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (27-29)
  • clp_s (31-33)
  • currentSplitLoaded_ (122-135)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
  • clp_s (21-23)
  • clp_s (25-28)
  • readerIndex_ (65-65)
  • getNumFilteredRows (45-58)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • vector (76-78)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
  • clp_s (27-29)
  • clp_s (31-33)
  • BaseClpCursor (68-137)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
  • clp_s (28-30)
  • facebook (32-69)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
  • clp_s (21-26)
  • clp_s (28-31)
  • clp_s (33-35)
  • facebook (37-90)
  • ClpQueryRunner (43-88)
  • vector (76-78)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (14)
  • ClpArchiveCursor (34-38)
  • ClpArchiveCursor (40-44)
  • fetchNext (46-90)
  • fetchNext (46-46)
  • getNumFilteredRows (92-94)
  • getNumFilteredRows (92-92)
  • createVector (96-108)
  • createVector (96-99)
  • loadSplit (119-232)
  • loadSplit (119-119)
  • getProjectedColumns (110-117)
  • getProjectedColumns (111-111)
  • createVectorHelper (234-264)
  • createVectorHelper (234-238)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
  • fetchNext (60-78)
  • fetchNext (60-62)
velox/connectors/clp/ClpDataSource.cpp (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
  • splitPath (67-73)
  • splitPath (67-70)
velox/connectors/clp/ClpDataSource.h (1)
  • completedRows_ (81-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (8)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)

19-26: Includes look correct and canonical.

Using fully-qualified include paths and required headers is consistent with neighbouring files.


34-38: Constructor string_view is safe via BaseClpCursor’s owning copy.

Forwarding std::string_view is fine since BaseClpCursor stores an owning std::string.


80-83: Confirm fetchNext()’s contract (rows scanned vs. rows produced).

You return rowsScanned but the caller consumes getNumFilteredRows() for materialization. Verify upstream (ClpDataSource) expects this; otherwise return the produced count to avoid confusion.

velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)

24-27: Link interface is correct.

Making velox_vector PUBLIC ensures headers propagate to dependants.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

17-26: Includes are tight and correct.

added; unused headers removed earlier. Looks good.

velox/connectors/clp/ClpDataSource.cpp (2)

23-24: Includes normalization looks good.

Paths are now consistent with the project’s absolute include style.


71-99: ARRAY handling: confirm only arrays-of-strings are possible.

Loader supports array-of-strings; if other element types can appear, we need element-aware mapping and loaders.

Would you like a follow-up patch to enforce/validate element types at planning time?

velox/connectors/clp/search_lib/BaseClpCursor.h (1)

19-26: Includes and transitive dependency usage look correct.

<retrieved_learnings> about InputSource via InputConfig.hpp and string_view→string copy pattern were applied appropriately.

@anlowee anlowee requested a review from wraymo September 4, 2025 22:01

#include <glog/logging.h>

#include "velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we group the headers?

int32_t currentSchemaId_{-1};
size_t currentSchemaIndex_{0};
bool currentSchemaTableLoaded_{false};
std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_ =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use = here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably place it in the constructor

/// column reader. It supports various column types including integers, floats,
/// booleans, strings, and arrays of strings.
class ClpVectorLoader : public VectorLoader {
class ClpArchiveVectorLoader : public VectorLoader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add archive to docstring?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (2)

264-268: Use-after-free risk: to_json_string() temporary backing store.

simdjson::to_json_string(...) returns a temporary whose storage may not outlive the statement; storing string_view to it is unsafe. Use raw_json() (points into the padded input) or materialise to std::string before set().

-        std::vector<std::string_view> arrayElements;
-        for (auto arrayElement : array) {
-          arrayElements.emplace_back(
-              simdjson::to_json_string(arrayElement).value());
-        }
+        std::vector<std::string_view> arrayElements;
+        arrayElements.reserve(8);
+        for (auto value : array) {
+          // raw_json() views the original padded input; valid until 'padded' goes out of scope.
+          simdjson::raw_json_string r = value.raw_json();
+          arrayElements.emplace_back(std::string_view(r));
+        }

If you prefer fully normalised JSON strings irrespective of input layout, copy:

-          arrayElements.emplace_back(std::string_view(r));
+          std::string s = std::string(std::string_view(r));
+          arrayElements.emplace_back(*elements->stringAllocator()->copy(s));

202-207: Handle ValueHook in loadInternal
The hook parameter is currently unused in ClpArchiveVectorLoader::loadInternal, violating the VectorLoader contract that non-null hooks must be called for each loaded value. Either implement hook support or add a VELOX_CHECK_NULL(hook); to enforce only null hooks. (velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp:205–210)

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)

80-86: Potential issue with schema advancement logic.

The current logic advances to the next schema whenever filteredRowIndices_ is empty, but this could skip remaining data in a schema if a batch returns no matches. Schema advancement should only occur when rowsScanned == 0 (indicating end-of-schema).

Consider this fix to properly handle schema advancement:

     auto rowsScanned = queryRunner_->fetchNext(numRows, filteredRowIndices_);
     if (false == filteredRowIndices_->empty()) {
       return rowsScanned;
     }
 
-    currentSchemaIndex_ += 1;
-    currentSchemaTableLoaded_ = false;
+    // Only advance schema if we've reached its end (no more rows to scan)
+    if (rowsScanned == 0) {
+      currentSchemaIndex_ += 1;
+      currentSchemaTableLoaded_ = false;
+    }
♻️ Duplicate comments (11)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (3)

19-19: Hide simdjson from the public header; move TLS parser to the .cpp.

Avoid leaking a heavy dependency and inline TLS initialisation in a public header. Forward-declare the parser type here and define/initialise the TLS in the .cpp.

-#include <simdjson.h>
+// Avoid exposing simdjson in the public header.
+namespace simdjson { namespace ondemand { class parser; } }

@@
-  inline static thread_local std::unique_ptr<simdjson::ondemand::parser>
-      arrayParser_ = std::make_unique<simdjson::ondemand::parser>();
+  // Defined and initialised in ClpArchiveVectorLoader.cpp
+  static thread_local std::unique_ptr<simdjson::ondemand::parser> arrayParser_;

And in ClpArchiveVectorLoader.cpp add:

+#include <simdjson.h>
@@
+thread_local std::unique_ptr<simdjson::ondemand::parser>
+    facebook::velox::connector::clp::search_lib::ClpArchiveVectorLoader::arrayParser_ =
+        std::make_unique<simdjson::ondemand::parser>();

Also applies to: 46-48


40-44: Fix filteredRowIndices move semantics; make pointee const.

The ctor takes a const shared_ptr and the member is non-const, so std::move in the .cpp won’t move. Pass by value (non-const) and store as shared_ptr<const vector<uint64_t>> to document immutability and enable efficient moves.

-  ClpArchiveVectorLoader(
-      clp_s::BaseColumnReader* columnReader,
-      ColumnType nodeType,
-      const std::shared_ptr<std::vector<uint64_t>> filteredRowIndices);
+  ClpArchiveVectorLoader(
+      clp_s::BaseColumnReader* columnReader,
+      ColumnType nodeType,
+      std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices);
@@
-  std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_;
+  std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices_;

Also applies to: 49-52


49-50: Clarify ownership/lifetime of columnReader_.

Raw pointer suggests non-ownership. Add a brief comment or switch to gsl::not_null<BaseColumnReader*>.

-  clp_s::BaseColumnReader* columnReader_;
+  // Non-owning. Must remain valid for this object's lifetime.
+  clp_s::BaseColumnReader* columnReader_;
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (2)

248-255: Propagate simdjson include and TLS definition into the .cpp.

Once the header stops including simdjson and the TLS is only declared there, this TU must include simdjson.h and define arrayParser_.

+#include <simdjson.h>
@@
+thread_local std::unique_ptr<simdjson::ondemand::parser>
+    facebook::velox::connector::clp::search_lib::ClpArchiveVectorLoader::arrayParser_ =
+        std::make_unique<simdjson::ondemand::parser>();

124-131: Enable actual move of filteredRowIndices and enforce const pointee.

With a const shared_ptr parameter, std::move performs a copy. Align the signature with the header change and move into the member.

-ClpArchiveVectorLoader::ClpArchiveVectorLoader(
-    clp_s::BaseColumnReader* columnReader,
-    ColumnType nodeType,
-    const std::shared_ptr<std::vector<uint64_t>> filteredRowIndices)
+ClpArchiveVectorLoader::ClpArchiveVectorLoader(
+    clp_s::BaseColumnReader* columnReader,
+    ColumnType nodeType,
+    std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices)
     : columnReader_(columnReader),
       nodeType_(nodeType),
       filteredRowIndices_(std::move(filteredRowIndices)) {}
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)

33-39: Return ErrorCode from executeQuery for better ergonomics.

Callers can short-circuit on errors instead of probing state.

-void BaseClpCursor::executeQuery(
+ErrorCode BaseClpCursor::executeQuery(
     const std::string& query,
     const std::vector<Field>& outputColumns) {
   query_ = query;
   outputColumns_ = outputColumns;
   errorCode_ = preprocessQuery();
+  return errorCode_;
 }

54-56: DRY the repeated “standardise + logically false” checks.

Factor a tiny helper/lambda to apply a pass and test for EmptyExpr.

-  OrOfAndForm standardizePass;
-  if (expr_ = standardizePass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  auto applyPass = [&](auto& pass) -> bool {
+    expr_ = pass.run(expr_);
+    return static_cast<bool>(std::dynamic_pointer_cast<EmptyExpr>(expr_));
+  };
+  OrOfAndForm standardizePass;
+  if (applyPass(standardizePass)) {
     VLOG(2) << "Query '" << query_ << "' is logically false";
     return ErrorCode::LogicalError;
   }
 
-  NarrowTypes narrowPass;
-  if (expr_ = narrowPass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  NarrowTypes narrowPass;
+  if (applyPass(narrowPass)) {
     VLOG(2) << "Query '" << query_ << "' is logically false";
     return ErrorCode::LogicalError;
   }
 
-  ConvertToExists convertPass;
-  if (expr_ = convertPass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  ConvertToExists convertPass;
+  if (applyPass(convertPass)) {
     VLOG(2) << "Query '" << query_ << "' is logically false";
     return ErrorCode::LogicalError;
   }

Also applies to: 61-63, 68-70

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)

67-67: Consider marking getProjectedColumns() with [[nodiscard]].

Since this accessor returns a reference that callers typically consume immediately, adding [[nodiscard]] would help catch accidental ignores at compile time.

-  const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
+  [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;

36-41: Make the cursor non-copyable and non-movable to prevent resource management issues.

Since the cursor manages shared resources like archiveReader_ and filteredRowIndices_, allowing copy or move operations could lead to aliasing issues and potential double-close hazards.

Add these declarations after the destructor:

   ~ClpArchiveCursor() override;
 
+  ClpArchiveCursor(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete;
+  ClpArchiveCursor(ClpArchiveCursor&&) = delete;
+  ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete;
+
   uint64_t fetchNext(uint64_t numRows) override;
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)

120-122: Fix C++17 compatibility issue with designated initializers.

Designated initializers require C++20, but Velox is configured for C++17 (CMAKE_CXX_STANDARD 17). This will cause a compilation error.

Apply this fix to make it C++17-compatible:

-  auto networkAuthOption = inputSource_ == InputSource::Filesystem
-      ? NetworkAuthOption{.method = AuthMethod::None}
-      : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4};
+  NetworkAuthOption networkAuthOption{};
+  networkAuthOption.method = (inputSource_ == InputSource::Filesystem)
+      ? AuthMethod::None
+      : AuthMethod::S3PresignedUrlV4;

249-250: Remove unnecessary null allocation for LazyVector.

Pre-allocating null bits for the base vector is wasteful since LazyVector will handle materialization. This unnecessarily allocates memory that may not be used.

   auto vector = BaseVector::create(vectorType, vectorSize, pool);
-  vector->setNulls(allocateNulls(vectorSize, pool, bits::kNull));
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7f25210 and e71a957.

📒 Files selected for processing (5)
  • velox/connectors/clp/search_lib/BaseClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (8 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.

Applied to files:

  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
  • velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
🧬 Code graph analysis (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
velox/vector/LazyVector.h (1)
  • VectorLoader (177-210)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (7)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)

35-38: Docstring accurately scopes to “archives”.

Good clarification aligning with the new split types.


33-34: Forward-declaring ColumnType in the header is appropriate.

Definition is only needed in the .cpp where enum values are used.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)

289-301: Type dispatch for Timestamp looks correct.

Covers Integer, DateString, Float, and falls back to Unknown.

velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)

41-52: Parse and early-empty checks look good.

Clear logging and appropriate ErrorCode mapping.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)

45-45: Mark getNumFilteredRows() as already const.

Good work on making this method const-qualified. This properly indicates that it doesn't modify the object state.

velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)

226-230: Good placement of stream opening and query runner initialization.

The code now properly opens packed streams and initializes the query runner within loadSplit(), setting currentSplitLoaded_ only after successful initialization. This ensures proper resource management and prevents partial initialization issues.


40-44: Proper resource cleanup in destructor.

The destructor correctly checks currentSplitLoaded_ before closing the archive reader, and with the stream opening now in loadSplit(), this prevents resource leaks.

Comment on lines +56 to +65
std::shared_ptr<clp_s::ArchiveReader> archiveReader_;
int32_t currentSchemaId_{-1};
size_t currentSchemaIndex_{0};
bool currentSchemaTableLoaded_{false};
std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_;
std::vector<int32_t> matchedSchemas_;
std::shared_ptr<clp_s::search::Projection> projection_;
std::shared_ptr<ClpQueryRunner> queryRunner_;
size_t readerIndex_{0};
std::shared_ptr<clp_s::search::SchemaMatch> schemaMatch_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider using a plain vector instead of shared_ptr for filteredRowIndices_.

The filteredRowIndices_ appears to have local ownership within this class. Using std::vector<uint64_t> directly would eliminate unnecessary heap indirection and reference counting overhead.

If ClpQueryRunner::fetchNext doesn't require shared ownership, consider refactoring to use a plain vector and pass by reference. This would require updating the ClpQueryRunner API.

🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h around lines
56–65, replace the field declared as std::shared_ptr<std::vector<uint64_t>>
filteredRowIndices_ with a plain std::vector<uint64_t> filteredRowIndices_
because ownership is local; update all places that assigned or accessed
filteredRowIndices_ to use the vector directly (initialize empty, use .clear(),
.empty(), size(), operator[]/at(), iterators), change any nullptr checks to
emptiness checks, and update calls into ClpQueryRunner::fetchNext (and its
signature) to accept a reference or const-reference to std::vector<uint64_t>
instead of a shared_ptr so you avoid unnecessary heap indirection and
refcounting; finally remove any now-unused shared_ptr includes and ensure
compilation by adjusting includes and call sites.

@anlowee anlowee requested a review from wraymo September 5, 2025 15:04
Copy link
Member

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@anlowee anlowee changed the title refactor: Extract ClpCursor interface and modify ClpVectorLoader to prepare support IR integration. refactor: Extract BaseClpCursor interface to support upcoming IR integration. Sep 5, 2025
@anlowee anlowee merged commit b8269e4 into y-scope:presto-0.293-clp-connector Sep 5, 2025
8 of 9 checks passed
@anlowee anlowee deleted the xwei/ir-integration-phase2-part1 branch September 5, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants