-
Notifications
You must be signed in to change notification settings - Fork 0
refactor: Extract BaseClpCursor
interface to support upcoming IR integration.
#25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Extract BaseClpCursor
interface to support upcoming IR integration.
#25
Conversation
…ementation to ClpArchiveCursor
WalkthroughAdds split-oriented CLP support: introduces BaseClpCursor and ClpArchiveCursor, moves vector loading to ClpArchiveVectorLoader, updates ClpDataSource to instantiate archive cursors for kArchive splits, reorganizes search_lib into an archive subtarget, and updates the CLP dependency pin. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Exec as Exec Plan
participant DS as ClpDataSource
participant Cur as ClpArchiveCursor
participant QR as ClpQueryRunner
participant AR as ArchiveReader
Exec->>DS: addSplit(split: kArchive, path)
DS->>Cur: construct(inputSource, splitPath)
Cur->>AR: loadSplit(): open_packed_streams(splitPath)
Cur->>QR: init(query, reader, projection)
Exec->>DS: next(numRows)
DS->>Cur: fetchNext(numRows)
Cur-->>DS: rowsFetched (count)
DS->>Cur: createVector(pool, vectorType, vectorSize)
Cur->>Cur: build vectors via ClpArchiveVectorLoader
Cur-->>DS: RowVector
DS-->>Exec: RowVector
sequenceDiagram
autonumber
actor Exec as Exec Plan
participant DS as ClpDataSource
participant B as BaseClpCursor
Exec->>DS: push-down query + outputColumns
DS->>B: executeQuery(query, outputColumns)
B->>B: preprocessQuery(): parse KQL -> transforms (OrOfAndForm, NarrowTypes, ConvertToExists)
alt parse/transform fails
B-->>DS: errorCode != Success
else
B-->>DS: Success
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)
14-20
: CMake target wiring LGTM; minor optional improvement for headersThe target_sources and include dirs look correct. Optional: expose headers via a FILE_SET to improve IDE integration and installation:
target_sources(clp-s-search PRIVATE ClpArchiveCursor.cpp ClpQueryRunner.cpp PUBLIC FILE_SET HEADERS FILES ClpArchiveCursor.h ClpQueryRunner.h )velox/connectors/clp/search_lib/BaseClpCursor.h (1)
68-73
: Update doc to be split-agnostic.The comment still says “archive”, but BaseClpCursor now abstracts split handling (archive and future IR). Consider rewording to “split”.
Suggested rewording:
“A query execution interface that manages the lifecycle of a query on a CLP-S split …”velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
67-73
: Don’t set errorCode_ on per-schema failure; it contaminates subsequent callsSetting errorCode_ = DictionaryNotFound on a single schema_init failure can cause future fetchNext calls to short-circuit even if later schemas succeed. The code already resets errorCode_ on successful schema table load; better to avoid setting an error for per-schema skips.
Apply this diff:
if (EvaluatedValue::False == queryRunner_->schema_init(currentSchemaId_)) { currentSchemaIndex_ += 1; currentSchemaTableLoaded_ = false; - errorCode_ = ErrorCode::DictionaryNotFound; continue; }
109-116
: Open packed streams and set loaded-flag inside loadSplit; close on failure; update log messageCentralise initialisation and make it exception-safe. This also aligns the terminology with the split-centric model.
Apply this diff:
- try { - archiveReader_->open( - get_path_object_for_raw_path(splitPath_), networkAuthOption); - } catch (std::exception& e) { - VLOG(2) << "Failed to open archive file: " << e.what(); - return ErrorCode::InternalError; - } + try { + archiveReader_->open( + get_path_object_for_raw_path(splitPath_), networkAuthOption); + archiveReader_->open_packed_streams(); + currentSplitLoaded_ = true; + } catch (std::exception& e) { + // Ensure we don’t leak a partially opened archive. + archiveReader_->close(); + VLOG(2) << "Failed to open split: " << e.what(); + return ErrorCode::InternalError; + }
198-202
: Remove duplicate timestamp index evaluationEvaluateTimestampIndex is already run earlier (Lines 121–126). Re-running it here is redundant and wastes cycles.
Apply this diff:
- EvaluateTimestampIndex timestamp_index(timestampDict); - if (EvaluatedValue::False == timestamp_index.run(expr_)) { - VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; - }
128-131
: Improve readability by separating assignment from conditionUsing an init-statement with an assignment inside the if hampers readability. Separate the steps.
Apply this diff:
- if (expr_ = schemaMatch_->run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = schemaMatch_->run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "No matching schemas for query '" << query_ << "'"; return ErrorCode::SchemaNotFound; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/ClpDataSource.h
(2 hunks)velox/connectors/clp/search_lib/BaseClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/BaseClpCursor.h
(3 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.h
(3 hunks)velox/connectors/clp/search_lib/archive/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(2 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (6)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(25-29)clp_s
(31-34)clp_s
(36-38)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
vector
(76-78)clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)ClpQueryRunner
(43-88)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)
ClpArchiveCursor
(32-36)ClpArchiveCursor
(38-42)fetchNext
(44-93)fetchNext
(44-46)getProjectedColumns
(95-102)getProjectedColumns
(96-96)loadSplit
(104-211)loadSplit
(104-104)velox/connectors/clp/search_lib/BaseClpCursor.h (4)
BaseClpCursor
(73-140)clp_s
(25-29)clp_s
(31-34)clp_s
(36-38)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(25-29)clp_s
(31-34)clp_s
(36-38)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (3)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
BaseClpCursor
(73-140)
velox/connectors/clp/ClpDataSource.cpp (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
splitPath
(67-73)splitPath
(67-70)
velox/connectors/clp/search_lib/BaseClpCursor.h (5)
velox/connectors/clp/ClpConnectorSplit.h (1)
SplitType
(42-47)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
fetchNext
(44-93)fetchNext
(44-46)getProjectedColumns
(95-102)getProjectedColumns
(96-96)loadSplit
(104-211)loadSplit
(104-104)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
vector
(76-78)clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
preprocessQuery
(41-76)preprocessQuery
(41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (15)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)
17-17
: Local include path change looks goodSwitching to a local include is consistent with the new archive target exposing its include dir. No functional impact.
velox/connectors/clp/ClpDataSource.h (1)
105-105
: Switch to BaseClpCursor pointer aligns with the new split-centric designUsing the base interface here is the right abstraction and keeps the header independent of archive/IR-specific implementations.
velox/connectors/clp/search_lib/CMakeLists.txt (4)
17-18
: Good modularization: Base cursor added to target.Adding BaseClpCursor.{h,cpp} to clp-s-search aligns with the new split-centric design. No issues spotted.
26-27
: Include directory scope: confirm consumers see search_lib headers.Making ${CMAKE_CURRENT_SOURCE_DIR} public is correct for exposing search_lib headers. Ensure any targets including headers as "search_lib/..." or "archive/..." have this directory (and archive/) on their include path via target linkage to clp-s-search or equivalent. Otherwise, prefer absolute includes (velox/connectors/clp/...) in consumers.
Would you like me to scan the repo for relative includes that rely on this particular include path and propose normalizations?
28-29
: Subtarget integration looks correct.Including the archive subdirectory keeps responsibilities separated and the build tidy. Good call.
30-39
: Link interface looks complete; C++20 requirement is fine.Linking clp_s::{archive_reader,clp_dependencies,io,search,search::kql} and velox_vector matches the usage in BaseClpCursor/ClpArchiveCursor. The cxx_std_20 requirement covers the if-with-initializer used in BaseClpCursor.cpp. LGTM.
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
41-48
: Robust parse failure handling is good.Parsing KQL via parse_kql_expression and returning InvalidQuerySyntax on nullptr is correct. VLOG(2) level is appropriate for diagnostics.
68-75
: Final pass and success path look good.ConvertToExists as the final pass and returning Success when not EmptyExpr is consistent with the described pipeline.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)
30-52
: Interface and state layout are coherent.Overrides for fetchNext/getProjectedColumns and split loading hooks match the base contract. Member state mirrors the archive scanning lifecycle. Looks good.
velox/connectors/clp/ClpDataSource.cpp (3)
106-115
: S3 URL construction path looks correct; consider guarding against null auth provider.The FS/Network selection and S3 URL construction make sense. If ClpConfig::kS3 is set, ensure s3AuthProvider_ is not null to avoid a crash.
Optionally add:
} else if (ClpConfig::StorageType::kS3 == storageType_) { inputSource = clp_s::InputSource::Network; - splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_); + VELOX_CHECK_NOT_NULL(s3AuthProvider_, "S3 storage selected but S3 auth provider is null"); + splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);
117-122
: Explicit split-type gating is good.Routing kArchive splits to ClpArchiveCursor and failing fast on unsupported types provides a clear error surface for future IR integration.
166-170
: Passing splitType to ClpVectorLoader is correct.Threading cursor_->getSplitType() into the loader matches the new split-aware loader signature. LGTM.
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
135-139
: Private preprocessQuery hook is correct.Keeping preprocessQuery private to the base and invoked via executeQuery centralizes validation. No issues.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
19-20
: Include rename aligns with split-centric refactor — LGTMHeader inclusion reflects the new archive-scoped cursor class and is consistent with the PR direction.
95-102
: getProjectedColumns fallback is safe and efficient — LGTMReturning a static empty vector avoids dangling references and extra allocations when the runner isn’t initialised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)
245-287
: Fix lifetime bug: don’t stash string_view to temporary JSON stringsto_json_string(...).value() returns a temporary std::string. Storing std::string_view to it produces a dangling reference. Use std::string to hold the values until set() copies into the vector’s string buffer.
- std::vector<std::string_view> arrayElements; - for (auto arrayElement : array) { - arrayElements.emplace_back( - simdjson::to_json_string(arrayElement).value()); - } + std::vector<std::string> arrayElements; + for (auto value : array) { + auto json = simdjson::to_json_string(value); + if (json.error()) { + VELOX_FAIL( + "Failed to serialize JSON array element at row {}: {}", + vectorIndex, + json.error_message()); + } + arrayElements.emplace_back(std::move(json.value())); + } @@ - arrayVector->setOffsetAndSize( - vectorIndex, elementIndex, arrayElements.size()); - for (auto& arrayElement : arrayElements) { - elements->set(elementIndex++, StringView(arrayElement)); - } + arrayVector->setOffsetAndSize( + vectorIndex, elementIndex, arrayElements.size()); + for (const auto& elem : arrayElements) { + elements->set(elementIndex++, StringView(elem)); + }velox/connectors/clp/search_lib/BaseClpCursor.h (1)
61-66
: Update docstring to reflect “split”-centric design.The class comment still speaks to “archive” explicitly. Given this is the split-agnostic base, consider updating terminology to “split” (archive or IR) to prevent confusion.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
99-111
: Close the archive on any early return after open (RAII guard).Once open() succeeds, subsequent early returns currently skip close(), leaking the handle. Add a scope guard after a successful open so all non-success returns close the archive automatically.
Option A (preferred, RAII via folly):
#include <glog/logging.h> +#include <folly/ScopeGuard.h> ... try { archiveReader_->open( get_path_object_for_raw_path(splitPath_), networkAuthOption); } catch (std::exception& e) { VLOG(2) << "Failed to open archive file: " << e.what(); return ErrorCode::InternalError; } + // Ensure we close the archive on any early return after a successful open. + auto closeGuard = folly::makeGuard([&]() noexcept { archiveReader_->close(); }); auto timestampDict = archiveReader_->get_timestamp_dictionary();Later, just before returning Success:
queryRunner_->global_init(); - return ErrorCode::Success; + closeGuard.dismiss(); + return ErrorCode::Success;Option B (no extra dependency): explicitly call close() before each early return after open(). For example:
if (clp_s::EvaluatedValue::False == timestampIndex.run(expr_)) { VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; + archiveReader_->close(); + return ErrorCode::InvalidTimestampRange; }Repeat for other error returns after open (SchemaNotFound, InternalError in projection loop, empty matchedSchemas, second timestamp check, etc.).
123-127
: Improve readability by separating assignment from the condition.The if-init with assignment is valid but harder to read. Splitting it clarifies intent.
Apply:
- if (expr_ = schemaMatch_->run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = schemaMatch_->run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "No matching schemas for query '" << query_ << "'"; return ErrorCode::SchemaNotFound; }
193-197
: Drop redundant second timestamp-range evaluation (minor perf/clarity).The timestamp index is evaluated twice against the same dictionary and expr_. The first check already early-outs before heavier work; the second appears redundant.
Apply:
- EvaluateTimestampIndex timestamp_index(timestampDict); - if (EvaluatedValue::False == timestamp_index.run(expr_)) { - VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; - }Alternatively, add a brief comment justifying the re-check if there’s a subtle reason to run it again.
♻️ Duplicate comments (9)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)
128-134
: Make filteredRowIndices const to encode read-only semanticsThe loader only reads indices. Using std::shared_ptr<const std::vector<uint64_t>> prevents accidental mutation.
ClpVectorLoader::ClpVectorLoader( clp_s::BaseColumnReader* columnReader, ColumnType nodeType, - std::shared_ptr<std::vector<uint64_t>> filteredRowIndices, + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices, ClpConnectorSplit::SplitType splitType) : columnReader_(columnReader), nodeType_(nodeType), filteredRowIndices_(std::move(filteredRowIndices)), splitType_(splitType) {}Note: Update the declaration and member type in ClpVectorLoader.h accordingly (see header comment).
velox/connectors/clp/search_lib/ClpVectorLoader.h (2)
23-23
: Fix include path to ClpConnectorSplit.h to match project conventionsOmitting the velox/ prefix will break builds when the include root isn’t the velox folder.
-#include "connectors/clp/ClpConnectorSplit.h" +#include "velox/connectors/clp/ClpConnectorSplit.h"
45-47
: Const-qualify filteredRowIndices and mark splitType_ maybe_unused (until used)
- filteredRowIndices is read-only; make it std::shared_ptr<const std::vector<uint64_t>>.
- splitType_ is currently stored but unused; silence potential warnings and document intent.
ClpVectorLoader( clp_s::BaseColumnReader* columnReader, ColumnType nodeType, - std::shared_ptr<std::vector<uint64_t>> filteredRowIndices, + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices, ClpConnectorSplit::SplitType splitType); @@ - std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_; - ClpConnectorSplit::SplitType splitType_; + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices_; + [[maybe_unused]] ClpConnectorSplit::SplitType splitType_;Optional: add a short TODO near splitType_ describing the planned use (selecting marshaling/population strategies by split type).
Also applies to: 65-67
velox/connectors/clp/search_lib/BaseClpCursor.cpp (3)
17-24
: Add missing include for std::istringstreamThis TU uses std::istringstream but doesn’t include , causing a compile error on stricter compilers.
#include "velox/connectors/clp/search_lib/BaseClpCursor.h" +#include <sstream> + #include "clp_s/search/ast/ConvertToExists.hpp" #include "clp_s/search/ast/EmptyExpr.hpp" #include "clp_s/search/ast/NarrowTypes.hpp" #include "clp_s/search/ast/OrOfAndForm.hpp" #include "clp_s/search/kql/kql.hpp"Also applies to: 39-41
31-37
: Consider returning ErrorCode from executeQuery to enable call-site short-circuitingReturning errorCode_ makes it trivial to skip fetchNext() on failure and simplifies tests.
-void BaseClpCursor::executeQuery( +ErrorCode BaseClpCursor::executeQuery( const std::string& query, const std::vector<Field>& outputColumns) { query_ = query; outputColumns_ = outputColumns; errorCode_ = preprocessQuery(); + return errorCode_; }Follow-up: update the declaration in BaseClpCursor.h and call sites.
52-57
: If-initializer idiom is valid; consider a tiny readability tweakThe assignment-in-if pattern is fine in C++17. If you prefer, split assignment and condition or add a brief comment—purely optional.
Also applies to: 59-64, 66-71
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
22-23
: Add missing standard headers for used types (self-contained header hygiene).This header uses std::string_view, std::shared_ptr, and uint64_t but doesn’t include <string_view>, , or . Headers should be self-contained to avoid brittle transitive-includes.
Apply this diff near the existing includes:
#include <string> #include <vector> +#include <cstdint> +#include <memory> +#include <string_view> #include "clp_s/InputConfig.hpp" #include "velox/connectors/clp/ClpConnectorSplit.h"
104-109
: Allow derived classes to set split type without mutating members directly.Expose a protected setter (or make splitType_ const and pass via base ctor). This avoids direct member mutation in derived constructors and clarifies intended usage.
Apply this diff in the protected section:
/// @return The error code. virtual ErrorCode loadSplit() = 0; + // Allow derived classes to set the split type explicitly. + void setSplitType(ClpConnectorSplit::SplitType t) { + splitType_ = t; + }Optionally, assert it’s set once if you expect immutability after construction.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)
206-211
: Good: moved open_packed_streams into loadSplit and set loaded flag after success.This centralises initialisation and reduces leak/consistency risks. With the earlier RAII guard, failure cases will be fully safe as well.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/search_lib/BaseClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/BaseClpCursor.h
(3 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.cpp
(2 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.h
(4 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(3 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
Applied to files:
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
32-33
: LGTM: Forward declaration for ClpQueryRunner fixes inverted dependencyThis local forward-declare avoids forcing BaseClpCursor to include archive headers. Good dependency hygiene.
36-39
: Ensure splitPath’s lifetime is owned downstreamTaking std::string_view is ideal for call-sites; ensure the implementation copies into an owning std::string (as BaseClpCursor does per earlier learning) to avoid dangling references.
Would you like me to scan ClpArchiveCursor.cpp to confirm the splitPath_ storage pattern matches BaseClpCursor’s?
velox/connectors/clp/ClpDataSource.cpp (2)
23-26
: LGTM: Normalised include paths to project-rooted formConsistent velox/…-prefixed includes improve portability across build targets.
166-170
: LGTM: Propagating split type into ClpVectorLoaderThis wires the split-centric model cleanly and prepares for IR-specific marshaling.
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
68-76
: Constructor correctly owns splitPath (good).Converting std::string_view to std::string in the member initializer avoids dangling references and matches the pattern noted in prior learnings.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)
33-37
: Ctor delegation to BaseClpCursor is correct; splitPath ownership is safe.Forwarding std::string_view to the base that copies into an owning std::string is safe and aligns with the established pattern.
0d064d5
to
081b9ca
Compare
081b9ca
to
7f40b7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have the full code of the IR implementation based on the current refactoring code?
Not yet, that will be in the next PR. But already have part of them, still need to finish the marshal part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
velox/connectors/clp/ClpDataSource.cpp (2)
125-131
: Surface query preprocessing failures to the pipeline (after executeQuery refactor)If you adopt returning ErrorCode from BaseClpCursor::executeQuery (see my other comment), check it here and fail early with a user error instead of silently returning no rows later.
Example (assuming executeQuery returns ErrorCode):
- if (pushDownQuery && !pushDownQuery->empty()) { - cursor_->executeQuery(*pushDownQuery, fields_); - } else { - cursor_->executeQuery("*", fields_); - } + const auto ec = (pushDownQuery && !pushDownQuery->empty()) + ? cursor_->executeQuery(*pushDownQuery, fields_) + : cursor_->executeQuery("*", fields_); + VELOX_USER_CHECK_EQ( + ec, + search_lib::ErrorCode::Success, + "Failed to preprocess KQL query (error code {}): {}", + static_cast<int>(ec), + pushDownQuery ? *pushDownQuery : "*");
176-177
: Micro-optimisation: reserve filteredRows capacityWe know the limit upfront (size). Reserving avoids reallocations on selective workloads.
- auto filteredRows = std::make_shared<std::vector<uint64_t>>(); + auto filteredRows = std::make_shared<std::vector<uint64_t>>(); + filteredRows->reserve(size);velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
116-121
: Close the archive on all early-return failure paths to avoid handle leaksloadSplit() opens the archive, then returns early on several error branches before currentSplitLoaded_ is set, skipping close() in the dtor. Ensure archiveReader_->close() is called before each early return.
Minimal fix (repeat close() before each return shown below):
EvaluateTimestampIndex timestampIndex(timestampDict); if (clp_s::EvaluatedValue::False == timestampIndex.run(expr_)) { VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; + archiveReader_->close(); + return ErrorCode::InvalidTimestampRange; } schemaMatch_ = std::make_shared<SchemaMatch>(schemaTree, schemaMap); if (expr_ = schemaMatch_->run(expr_); std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "No matching schemas for query '" << query_ << "'"; - return ErrorCode::SchemaNotFound; + archiveReader_->close(); + return ErrorCode::SchemaNotFound; } @@ } catch (TraceableException& e) { VLOG(2) << e.what(); - return ErrorCode::InternalError; + archiveReader_->close(); + return ErrorCode::InternalError; } @@ if (matchedSchemas_.empty()) { - return ErrorCode::SchemaNotFound; + archiveReader_->close(); + return ErrorCode::SchemaNotFound; } @@ if (EvaluatedValue::False == timestamp_index.run(expr_)) { VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; + archiveReader_->close(); + return ErrorCode::InvalidTimestampRange; }Alternatively, add an RAII scope guard (e.g., folly::ScopeGuard) right after open() and release it only after currentSplitLoaded_ is set. I can provide that patch if preferred.
Also applies to: 123-127, 173-176, 189-191, 194-197
136-139
: Spellcheck log message (“Cannot” vs “Can not”)Minor polish for operator logs.
- VLOG(2) << "Can not tokenize invalid column: '" << column.name << "'"; + VLOG(2) << "Cannot tokenize invalid column: '" << column.name << "'";
♻️ Duplicate comments (12)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)
24-24
: Include is necessary for ColumnType enum values.Based on previous learnings, this include is required because the implementation uses specific ColumnType enum values (like
ColumnType::Integer
,ColumnType::Float
) in switch statements. While the header has a forward declaration, the full definition from BaseClpCursor.h is needed here.velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)
14-20
: Consider using velox_add_library consistently with project conventions.The target name
clp-s-archive-search
uses a hyphenated format instead of the underscore convention typically used in Velox. Consider usingclp_s_archive_search
for consistency with other Velox library targets.Apply this diff to align with Velox naming conventions:
-velox_add_library( - clp-s-archive-search +velox_add_library( + clp_s_archive_search STATIC ClpArchiveCursor.cpp ClpArchiveCursor.h ClpQueryRunner.cpp ClpQueryRunner.h)velox/connectors/clp/search_lib/ClpVectorLoader.h (2)
42-46
: Consider const-correctness for filteredRowIndices parameter.Since the vector is only read from and never modified, consider making the parameter const to enforce read-only semantics and prevent accidental mutations.
Apply this diff to improve const-correctness:
ClpVectorLoader( clp_s::BaseColumnReader* columnReader, ColumnType nodeType, - std::shared_ptr<std::vector<uint64_t>> filteredRowIndices, + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices, ClpConnectorSplit::SplitType splitType);
66-66
: Mark unused splitType_ member appropriately.The splitType_ member is currently unused in the implementation. Consider adding a comment or [[maybe_unused]] attribute to clarify its intended future use.
Apply one of these approaches:
- ClpConnectorSplit::SplitType splitType_; + [[maybe_unused]] ClpConnectorSplit::SplitType splitType_;Or add a comment explaining its planned use:
- ClpConnectorSplit::SplitType splitType_; + ClpConnectorSplit::SplitType splitType_; // TODO: Used for marshalling strategy selection between archive and IRvelox/connectors/clp/search_lib/BaseClpCursor.cpp (3)
33-39
: Return an ErrorCode from executeQuery so callers can short-circuit on failureRight now, executeQuery sets errorCode_ but offers no programmatic signal to the caller. Returning ErrorCode lets ClpDataSource (and future IR paths) fail fast without entering fetchNext on invalid/unsatisfiable queries. This also simplifies testing.
Apply this local change:
-void BaseClpCursor::executeQuery( +ErrorCode BaseClpCursor::executeQuery( const std::string& query, const std::vector<Field>& outputColumns) { query_ = query; outputColumns_ = outputColumns; errorCode_ = preprocessQuery(); + return errorCode_; }And in BaseClpCursor.h, update the declaration accordingly (see my comment there for a full header diff), and consider checking the return in ClpDataSource::addSplit before proceeding.
54-56
: If-initializer idiom is valid; consider readability tweakThe “if (expr_ = pass.run(expr_); dynamic_pointer_cast(expr_))” pattern is correct (C++17), but slightly dense. Consider either a short comment explaining the combined assign+check intent or splitting into two statements for clarity.
Also applies to: 61-63, 69-71
49-52
: Deduplicate the repeated “logically false” branchesThe four identical “logically false” branches can be DRY’d to a small helper to keep this routine compact.
Example refactor (local to this function):
ErrorCode BaseClpCursor::preprocessQuery() { auto queryStream = std::istringstream(query_); expr_ = kql::parse_kql_expression(queryStream); if (nullptr == expr_) { VLOG(2) << "Failed to parse query '" << query_ << "'"; return ErrorCode::InvalidQuerySyntax; } - - if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { - VLOG(2) << "Query '" << query_ << "' is logically false"; - return ErrorCode::LogicalError; - } + auto isFalse = [this]() { + return std::dynamic_pointer_cast<EmptyExpr>(expr_) != nullptr; + }; + auto logFalseAnd = [this]() { + VLOG(2) << "Query '" << query_ << "' is logically false"; + return ErrorCode::LogicalError; + }; + if (isFalse()) { + return logFalseAnd(); + } OrOfAndForm standardizePass; - if (expr_ = standardizePass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { - VLOG(2) << "Query '" << query_ << "' is logically false"; - return ErrorCode::LogicalError; - } + if (expr_ = standardizePass.run(expr_); isFalse()) { + return logFalseAnd(); + } NarrowTypes narrowPass; - if (expr_ = narrowPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { - VLOG(2) << "Query '" << query_ << "' is logically false"; - return ErrorCode::LogicalError; - } + if (expr_ = narrowPass.run(expr_); isFalse()) { + return logFalseAnd(); + } ConvertToExists convertPass; - if (expr_ = convertPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { - VLOG(2) << "Query '" << query_ << "' is logically false"; - return ErrorCode::LogicalError; - } + if (expr_ = convertPass.run(expr_); isFalse()) { + return logFalseAnd(); + } return ErrorCode::Success; }Also applies to: 56-59, 63-66, 70-73
velox/connectors/clp/ClpDataSource.cpp (2)
103-107
: Defensive: null-check the split cast before dereferenceclpSplit is dereferenced unconditionally; if a non-CLP split slips through, this will crash.
void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) { auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split); + VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit"); std::string splitPath = clpSplit->path_;
110-113
: Guard S3 path construction behind a non-null auth providerAvoid a null deref when storageType_ is kS3 and the provider isn’t configured.
} else if (ClpConfig::StorageType::kS3 == storageType_) { inputSource = clp_s::InputSource::Network; + VELOX_CHECK_NOT_NULL( + s3AuthProvider_, "S3 storage requires a configured S3 auth provider"); splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);velox/connectors/clp/search_lib/BaseClpCursor.h (3)
85-87
: Return an ErrorCode from executeQuery (and mark nodiscard)Enable call sites to branch immediately on parse/validation errors; nodiscard nudges correct handling.
- void executeQuery( + [[nodiscard]] ErrorCode executeQuery( const std::string& query, const std::vector<Field>& outputColumns);
95-98
: Prefer a non-owning reference for the output collectionPassing filteredRowIndices as shared_ptr suggests shared ownership where a simple out-parameter suffices. Using std::vector<uint64_t>& reduces heap traffic and clarifies ownership. This change ripples to ClpArchiveCursor and ClpQueryRunner.
- virtual uint64_t fetchNext( - uint64_t numRows, - const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices) = 0; + virtual uint64_t fetchNext( + uint64_t numRows, + std::vector<uint64_t>& filteredRowIndices) = 0;Follow-ups: update ClpArchiveCursor::{fetchNext,getProjectedColumns} signatures and their definitions, ClpQueryRunner::fetchNext, and ClpDataSource::next call site.
106-111
: Split type plumbing: add a protected setter or ctor paramDerived cursors (e.g., IR) will need to set splitType_. Add a protected setter or overload the base ctor to take SplitType so derived classes don’t have to mutate internals directly. Also, mark the accessor nodiscard.
- ClpConnectorSplit::SplitType getSplitType() const { + [[nodiscard]] ClpConnectorSplit::SplitType getSplitType() const { return splitType_; } + + protected: + void setSplitType(ClpConnectorSplit::SplitType t) { splitType_ = t; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/search_lib/BaseClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/BaseClpCursor.h
(3 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.cpp
(2 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.h
(4 hunks)velox/connectors/clp/search_lib/archive/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(3 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp
(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/ClpVectorLoader.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/ClpVectorLoader.cpp
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/ClpVectorLoader.h
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.
Applied to files:
velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-07-11T15:21:09.658Z
Learnt from: gibber9809
PR: y-scope/velox#15
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:22-22
Timestamp: 2025-07-11T15:21:09.658Z
Learning: When analyzing include dependencies, pay special attention to types used in template parameters and enum values like `clp_s::NodeType`, as these may not be immediately obvious in a surface-level review but are essential for compilation.
Applied to files:
velox/connectors/clp/search_lib/ClpVectorLoader.cpp
🧬 Code Graph Analysis (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/ClpVectorLoader.h (2)
clp_s
(29-31)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)ClpQueryRunner
(43-88)vector
(76-78)velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(27-29)clp_s
(31-33)BaseClpCursor
(68-135)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (8)
ClpArchiveCursor
(33-37)ClpArchiveCursor
(39-43)fetchNext
(45-88)fetchNext
(45-47)getProjectedColumns
(90-97)getProjectedColumns
(91-91)loadSplit
(99-212)loadSplit
(99-99)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
clp_s
(27-29)clp_s
(31-33)
velox/connectors/clp/ClpDataSource.cpp (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
splitPath
(67-73)splitPath
(67-70)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (5)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
clp_s
(21-23)clp_s
(25-28)velox/connectors/clp/search_lib/ClpVectorLoader.h (1)
clp_s
(29-31)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (3)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)velox/connectors/clp/search_lib/BaseClpCursor.h (2)
clp_s
(27-29)clp_s
(31-33)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)
velox/connectors/clp/search_lib/ClpVectorLoader.h (1)
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
ColumnType
(48-137)
velox/connectors/clp/search_lib/BaseClpCursor.h (5)
velox/connectors/clp/ClpConnectorSplit.h (1)
SplitType
(42-47)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(61-79)fetchNext
(61-63)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
fetchNext
(45-88)fetchNext
(45-47)getProjectedColumns
(90-97)getProjectedColumns
(91-91)loadSplit
(99-212)loadSplit
(99-99)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
vector
(76-78)clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
preprocessQuery
(41-76)preprocessQuery
(41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (11)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)
17-17
: LGTM! Header path correctly updated for the archive reorganization.The include path change from the root search_lib to archive-specific directory aligns with the PR's refactoring to move archive-related functionality into a dedicated archive subdirectory.
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)
125-133
: Constructor correctly wired with splitType parameter.The new splitType parameter is properly integrated into the constructor signature and member initialization. The implementation correctly stores the split type for future use in marshalling decisions between archive and IR formats.
velox/connectors/clp/search_lib/ClpVectorLoader.h (3)
19-19
: Include added for simdjson support.The simdjson.h include is correctly added to support the thread_local arrayParser_ member used for JSON array processing in the implementation.
23-23
: Include path correctly uses velox prefix.The ClpConnectorSplit.h include path correctly uses the full velox/ prefix, maintaining consistency with the project's include conventions.
35-35
: Forward declaration properly added for ColumnType.The forward declaration enables the header to use ColumnType as a parameter type without including the full definition, reducing compilation dependencies.
velox/connectors/clp/search_lib/CMakeLists.txt (3)
17-18
: Source file transition correctly implemented.The replacement of ClpCursor.* with BaseClpCursor.* files correctly reflects the interface extraction described in the PR objectives.
26-26
: Archive subdirectory properly integrated.The add_subdirectory(archive) correctly incorporates the new archive-specific functionality into the build system.
28-28
: Linking simplified with archive target.The link command correctly simplifies to use the new clp-s-archive-search target instead of individual library dependencies, which improves maintainability.
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
42-47
: Confirm parse_kql_expression never throws; otherwise wrap with try/catchIf kql::parse_kql_expression can throw (e.g., on malformed input), preprocessQuery would allow exceptions to propagate. If it’s documented to return nullptr on failure and not throw, we’re fine. Otherwise, add a try/catch here mirroring the logging style below and return InvalidQuerySyntax.
Would you verify the exception behaviour of parse_kql_expression in clp_s? If it may throw, I can propose a concrete try/catch patch.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
193-197
: Is the second timestamp-index check necessary?We run EvaluateTimestampIndex before and after schema matching. If the transform cannot introduce new timestamp contradictions, the second pass may be redundant work on large dictionaries.
Would you confirm if schemaMatch_->run(expr_) can change timestamp predicates? If not, we can drop the second check to save cycles.
33-37
: Constructor path lifetime: safe as implemented (ack)Per the retrieved learning, BaseClpCursor copies std::string_view into an owning std::string member. No dangling risk here.
Using std::string_view in the ctor and copying into splitPath_ strikes the right balance between API ergonomics and safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
60-61
: Avoid shadowing Velox’s VectorPtr alias in template parameter.Rename the template parameter to reduce confusion with facebook::velox::VectorPtr.
Apply:
-template <typename T, typename VectorPtr> -void populateData(RowSet rows, VectorPtr vector); +template <typename T, typename VecPtr> +void populateData(RowSet rows, VecPtr vector);
♻️ Duplicate comments (10)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)
17-17
: Include path correction LGTM; aligns with “use velox path”Matches prior guidance and the new archive layout.
velox/connectors/clp/ClpDataSource.h (1)
23-23
: Keep BaseClpCursor.h included hereRequired because this header stores
std::vector<search_lib::Field>
; forward declaration is insufficient. Agree with prior resolution.velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
34-35
: Ensure .cpp includes BaseClpCursor.h for ColumnType.ColumnType is only forward-declared here. The .cpp must include BaseClpCursor.h to use its enum values in switches.
velox/connectors/clp/ClpDataSource.cpp (1)
106-115
: Null-check the cast and guard S3 auth provider.Avoid potential nullptr deref on non-CLP splits and on missing S3 auth provider.
void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) { auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split); + VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit"); std::string splitPath = clpSplit->path_; clp_s::InputSource inputSource; if (ClpConfig::StorageType::kFs == storageType_) { inputSource = clp_s::InputSource::Filesystem; } else if (ClpConfig::StorageType::kS3 == storageType_) { inputSource = clp_s::InputSource::Network; + VELOX_CHECK_NOT_NULL( + s3AuthProvider_, "S3 storage requires a configured S3 auth provider"); splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_); } else { VELOX_UNREACHABLE(); }velox/connectors/clp/search_lib/BaseClpCursor.h (1)
70-76
: Make base cursor non-copyable/non-movable and dtor noexcept.Prevents accidental aliasing of IO state; clarifies exception guarantees.
explicit BaseClpCursor( clp_s::InputSource inputSource, std::string_view splitPath) : errorCode_(ErrorCode::QueryNotInitialized), inputSource_(inputSource), splitPath_(std::string(splitPath)) {} - virtual ~BaseClpCursor() = default; + virtual ~BaseClpCursor() noexcept = default; + + BaseClpCursor(const BaseClpCursor&) = delete; + BaseClpCursor& operator=(const BaseClpCursor&) = delete; + BaseClpCursor(BaseClpCursor&&) = delete; + BaseClpCursor& operator=(BaseClpCursor&&) = delete;velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
38-41
: Disable copy/move; cursor manages shared IO state.Prevents accidental aliasing/double-closing of
ArchiveReader
.explicit ClpArchiveCursor( clp_s::InputSource inputSource, std::string_view splitPath); ~ClpArchiveCursor() override; + + ClpArchiveCursor(const ClpArchiveCursor&) = delete; + ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete; + ClpArchiveCursor(ClpArchiveCursor&&) = delete; + ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete;
68-69
: Hint to callers: mark accessor nodiscard.The projected-columns ref should not be ignored.
- const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const; + [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
34-38
: Constructor properly handles string lifetime.Based on the previous learning about
BaseClpCursor
, the base class constructor safely converts thestd::string_view
to an ownedstd::string
member, so there are no lifetime concerns here.
120-122
: Fix C++20 designated initializers for C++17 compatibility.The project uses C++17 standard, but this code uses C++20 designated initializers which will fail to compile.
Apply this fix for C++17 compatibility:
- auto networkAuthOption = inputSource_ == InputSource::Filesystem - ? NetworkAuthOption{.method = AuthMethod::None} - : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4}; + NetworkAuthOption networkAuthOption{}; + networkAuthOption.method = (inputSource_ == InputSource::Filesystem) + ? AuthMethod::None + : AuthMethod::S3PresignedUrlV4;
40-44
: Resource management issue: archive may leak on error paths.The destructor only closes the archive if
currentSplitLoaded_
is true, but inloadSplit()
, the archive can be opened (Line 125-126) and then return early on error before settingcurrentSplitLoaded_
to true (Line 227). This leaves the archive open without proper cleanup.Consider either:
- Set
currentSplitLoaded_ = true
immediately after successfulopen()
call- Add RAII wrapper or ensure cleanup on all error paths
- Have
ArchiveReader
provide anisOpen()
API as suggested in the previous reviewApply this fix to ensure proper cleanup:
try { archiveReader_->open( get_path_object_for_raw_path(splitPath_), networkAuthOption); + currentSplitLoaded_ = true; // Mark as loaded immediately after successful open } catch (std::exception& e) { VLOG(2) << "Failed to open archive file: " << e.what(); return ErrorCode::InternalError; }
Then move the stream opening earlier:
archiveReader_->read_array_dictionary(); + archiveReader_->open_packed_streams(); currentSchemaIndex_ = 0; currentSchemaTableLoaded_ = false; - - archiveReader_->open_packed_streams(); - currentSplitLoaded_ = true;Also applies to: 226-227
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
CMake/resolve_dependency_modules/clp.cmake
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/ClpDataSource.h
(2 hunks)velox/connectors/clp/search_lib/BaseClpCursor.h
(3 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(4 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
(8 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
(2 hunks)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp
(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.
Applied to files:
velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
velox/connectors/clp/ClpDataSource.h
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
velox/connectors/clp/ClpDataSource.h
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/BaseClpCursor.h
📚 Learning: 2025-08-20T17:50:58.584Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/ClpDataSource.h:23-23
Timestamp: 2025-08-20T17:50:58.584Z
Learning: In C++ headers, you cannot forward-declare types that are used as value members (like std::vector<Field> fields_) because the compiler needs the complete type definition to calculate the containing class size. Forward declarations only work for pointers and references to types.
Applied to files:
velox/connectors/clp/ClpDataSource.h
🧬 Code graph analysis (6)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
ClpArchiveVectorLoader
(125-131)velox/vector/LazyVector.h (1)
VectorLoader
(177-210)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
clp_s
(27-29)clp_s
(31-33)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
clp_s
(28-30)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)ClpQueryRunner
(43-88)vector
(76-78)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (14)
ClpArchiveCursor
(34-38)ClpArchiveCursor
(40-44)fetchNext
(46-90)fetchNext
(46-46)getNumFilteredRows
(92-94)getNumFilteredRows
(92-92)createVector
(96-108)createVector
(96-99)loadSplit
(119-232)loadSplit
(119-119)getProjectedColumns
(110-117)getProjectedColumns
(111-111)createVectorHelper
(234-264)createVectorHelper
(234-238)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
ClpArchiveVectorLoader
(39-67)
velox/connectors/clp/ClpDataSource.cpp (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
splitPath
(67-73)splitPath
(67-70)velox/connectors/clp/ClpDataSource.h (1)
completedRows_
(81-81)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(27-29)clp_s
(31-33)currentSplitLoaded_
(122-137)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
clp_s
(21-23)clp_s
(25-28)readerIndex_
(65-65)getNumFilteredRows
(45-58)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)vector
(76-78)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)
velox/connectors/clp/search_lib/BaseClpCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
fetchNext
(46-90)fetchNext
(46-46)getNumFilteredRows
(92-94)getNumFilteredRows
(92-92)loadSplit
(119-232)loadSplit
(119-119)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
getNumFilteredRows
(45-58)clp_s
(21-23)clp_s
(25-28)velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
preprocessQuery
(41-76)preprocessQuery
(41-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (19)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (1)
17-17
: No stale includes remain – verified there are no remainingvelox/connectors/clp/search_lib/ClpQueryRunner.h
includes.velox/connectors/clp/ClpDataSource.h (1)
86-86
: Verified cursor construction and no createVector dependenciesClpDataSource.cpp correctly instantiates ClpArchiveCursor for archive splits (line 119), and no external calls to the removed
createVector
remain.velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)
14-22
: Target definition looks good.New archive subtarget and sources are wired up cleanly.
velox/connectors/clp/search_lib/CMakeLists.txt (3)
17-18
: Switch to BaseClpCursor sources looks correct.Good replacement of ClpCursor.* with BaseClpCursor.*.
24-24
: Subdirectory integration is fine.archive subproject is added before linking and scopes dependencies properly.
26-26
: Linkage validated; no stale ClpVectorLoader/ClpQueryRunner references remainvelox/connectors/clp/ClpDataSource.cpp (3)
24-25
: LGTM: include path normalization.Using project-absolute headers is consistent with the rest of the file.
117-123
: Archive routing looks correct; IR rejection is explicit.
VELOCX_UNSUPPORTED
for non-archive split types is appropriate pending IR work.
136-144
: Clarify completedRows_ accounting and use static_pointer_cast
- Confirm and document whether
completedRows_
is meant to track fetched rows (rowsScanned
) or output rows (rowsFiltered
) and update accordingly:- completedRows_ += rowsScanned; + completedRows_ += rowsFiltered; // or retain rowsScanned if tracking fetched rows- Replace the RTTI‐based cast with a compile-time cast:
- return std::dynamic_pointer_cast<RowVector>( + return std::static_pointer_cast<RowVector>( cursor_->createVector(pool_, outputType_, rowsFiltered));velox/connectors/clp/search_lib/BaseClpCursor.h (3)
19-22
: LGTM: added standard headers.Adding and <string_view> resolves transitive-dependency brittleness.
63-68
: Docstring update reads well.The split-centric phrasing (archive or IR) matches the new design.
24-26
: Remove unused dependency on ClpConnectorSplit.h.This header doesn’t reference
ClpConnectorSplit
; drop the include to decouple the interface.-#include "velox/connectors/clp/ClpConnectorSplit.h"
Likely an incorrect or invalid review comment.
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (4)
143-143
: Good improvement using bounds-checked access.The change from
operator[]
toat()
for accessingfilteredRowIndices_
provides runtime bounds checking, which helps catch index-out-of-bounds errors early and improves safety.Also applies to: 180-180
128-128
: LGTM! Constructor now properly accepts const reference.The change to
const std::shared_ptr<std::vector<uint64_t>>
is a good improvement that prevents unintended modifications to the shared pointer itself while still allowing the underlying vector to be modified through the shared pointer.
125-131
: Class migration from ClpVectorLoader to ClpArchiveVectorLoader completed successfully.All method implementations have been properly updated to use the new
ClpArchiveVectorLoader
namespace, including the constructor,populateData
,populateTimestampData
, andloadInternal
methods. The explicit template instantiations at the end of the file have also been correctly updated.Also applies to: 134-156, 159-201, 203-308, 311-338
244-244
: Add micro-benchmarks for bounds-checked access
No existing benchmarks coverClpArchiveVectorLoader
’s use ofat()
in the hot loop. Please add or run performance tests to confirm the bounds-check overhead is acceptable.velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
46-90
: Clear and correct implementation of fetchNext.The method properly initializes state, handles error conditions, and iterates through matched schemas with appropriate loading and filtering logic.
92-108
: Well-designed vector creation with proper validation.The
createVector
method properly validates that projected columns match output columns before delegating to the helper function. Good defensive programming.
234-264
: Clever recursive ROW vector construction with LazyVector support.The
createVectorHelper
implementation elegantly handles nested ROW types through recursion and properly wraps leaf vectors withLazyVector
for deferred loading. The use ofreaderIndex_
to track position across recursive calls is well-designed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
64-69
: Don’t poison the cursor with an error for a single schema-init failure.Setting errorCode_ here causes subsequent fetchNext() calls to short‑circuit even if other schemas are fine. Just skip the schema and continue; set an error only if all schemas fail.
Apply:
- errorCode_ = ErrorCode::DictionaryNotFound;
124-131
: Close the archive on all failure paths (RAII guard).loadSplit() opens the archive, then returns early on several errors without closing. Since the dtor closes only when currentSplitLoaded_ is true, handles can leak.
Apply:
@@ - try { - archiveReader_->open( - get_path_object_for_raw_path(splitPath_), networkAuthOption); + try { + archiveReader_->open( + get_path_object_for_raw_path(splitPath_), networkAuthOption); + auto closeOnError = true; + auto cleanup = folly::makeGuard([&] { + if (closeOnError) { + archiveReader_->close(); + } + }); @@ - return ErrorCode::InvalidTimestampRange; + return ErrorCode::InvalidTimestampRange; } @@ - return ErrorCode::SchemaNotFound; + return ErrorCode::SchemaNotFound; } @@ - return ErrorCode::InternalError; + return ErrorCode::InternalError; } @@ - return ErrorCode::SchemaNotFound; + return ErrorCode::SchemaNotFound; } @@ - return ErrorCode::InvalidTimestampRange; + return ErrorCode::InvalidTimestampRange; } @@ - archiveReader_->open_packed_streams(); - currentSplitLoaded_ = true; + archiveReader_->open_packed_streams(); + currentSplitLoaded_ = true; + cleanup.dismiss(); // success: keep archive openAnd add the include:
#include <glog/logging.h> +#include <folly/ScopeGuard.h>
Also applies to: 139-147, 159-160, 194-196, 209-211, 215-217, 219-231
143-146
: If-initializer assignment is valid; consider readability.Optional: split assignment and condition for clarity.
Apply:
- if (expr_ = schemaMatch_->run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = schemaMatch_->run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
213-217
: Naming consistency.Use camelCase consistently: rename timestamp_index → timestampIndex.
- EvaluateTimestampIndex timestamp_index(timestampDict); - if (EvaluatedValue::False == timestamp_index.run(expr_)) { + EvaluateTimestampIndex timestampIndex(timestampDict); + if (EvaluatedValue::False == timestampIndex.run(expr_)) {velox/connectors/clp/ClpDataSource.cpp (1)
124-129
: Propagate query preprocessing failure immediately.BaseClpCursor tracks an ErrorCode but there’s no check here; callers won’t see failures until later as “no rows”. Add a status accessor and check it.
auto pushDownQuery = clpSplit->kqlQuery_; if (pushDownQuery && !pushDownQuery->empty()) { cursor_->executeQuery(*pushDownQuery, fields_); } else { cursor_->executeQuery("*", fields_); } + + // Fail early if query preprocessing failed. + VELOX_CHECK_EQ( + search_lib::ErrorCode::Success, + cursor_->getErrorCode(), + "CLP query initialization failed (error code {}).", + static_cast<int>(cursor_->getErrorCode()));Add the accessor to BaseClpCursor (see separate comment).
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
84-87
: Expose execution status to callers.Add a trivial accessor so addSplit() can fail early on invalid queries.
void executeQuery( const std::string& query, const std::vector<Field>& outputColumns); + + [[nodiscard]] ErrorCode getErrorCode() const noexcept { + return errorCode_; + }
♻️ Duplicate comments (9)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)
119-127
: Avoid C++20-only designated initializers unless the whole build targets C++20.Either keep the target on C++17 and use a compatible init, or make C++20 a PUBLIC feature for all dependants. Safer is to stick to C++17 here.
Apply:
- auto networkAuthOption = inputSource_ == InputSource::Filesystem - ? NetworkAuthOption{.method = AuthMethod::None} - : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4}; + NetworkAuthOption networkAuthOption{}; + networkAuthOption.method = (inputSource_ == InputSource::Filesystem) + ? AuthMethod::None + : AuthMethod::S3PresignedUrlV4;velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
34-40
: Return the status from executeQuery for easier short‑circuiting.Returning ErrorCode improves ergonomics at call sites.
-void BaseClpCursor::executeQuery( +ErrorCode BaseClpCursor::executeQuery( const std::string& query, const std::vector<Field>& outputColumns) { query_ = query; outputColumns_ = outputColumns; errorCode_ = preprocessQuery(); + return errorCode_; }Note: update the declaration and call sites accordingly.
velox/connectors/clp/ClpDataSource.cpp (2)
102-104
: Null-check the split cast before dereference.Avoid a crash if a non-CLP split slips through.
void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) { auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split); + VELOX_CHECK_NOT_NULL(clpSplit, "Expected ClpConnectorSplit");
109-112
: Guard against missing S3 auth provider in S3 mode.Fail fast with a clear message instead of NPE.
} else if (ClpConfig::StorageType::kS3 == storageType_) { inputSource = clp_s::InputSource::Network; + VELOX_CHECK_NOT_NULL( + s3AuthProvider_, "S3 storage requires a configured S3 auth provider"); splitPath = s3AuthProvider_->constructS3Url(clpSplit->path_);velox/connectors/clp/search_lib/BaseClpCursor.h (2)
70-76
: Make the base cursor non-copyable/non-movable and dtor noexcept.Prevents accidental aliasing of IO state; clarifies exception guarantees.
explicit BaseClpCursor( clp_s::InputSource inputSource, std::string_view splitPath) : errorCode_(ErrorCode::QueryNotInitialized), inputSource_(inputSource), splitPath_(std::string(splitPath)) {} - virtual ~BaseClpCursor() = default; + virtual ~BaseClpCursor() noexcept = default; + + BaseClpCursor(const BaseClpCursor&) = delete; + BaseClpCursor& operator=(const BaseClpCursor&) = delete; + BaseClpCursor(BaseClpCursor&&) = delete; + BaseClpCursor& operator=(BaseClpCursor&&) = delete;
95-100
: Mark row-count accessor nodiscard.Helps catch ignored results at call sites.
- virtual size_t getNumFilteredRows() const = 0; + [[nodiscard]] virtual size_t getNumFilteredRows() const = 0;velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
37-41
: Disable copy/move; optionally make dtor noexcept.Avoid shared ArchiveReader/query state aliasing.
class ClpArchiveCursor final : public BaseClpCursor { public: explicit ClpArchiveCursor( clp_s::InputSource inputSource, std::string_view splitPath); - ~ClpArchiveCursor() override; + ~ClpArchiveCursor() noexcept override; + + ClpArchiveCursor(const ClpArchiveCursor&) = delete; + ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete; + ClpArchiveCursor(ClpArchiveCursor&&) = delete; + ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete;
68-74
: Mark projected-columns accessor nodiscard.It’s almost always consumed; this catches accidental ignores.
- const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const; + [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
60-61
: Avoid shared_ptr for an internal scratch vector.Ownership is local; pass a reference to reduce heap/refcount overhead. This change ripples to ClpQueryRunner.
- std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_ = - std::make_shared<std::vector<uint64_t>>(); + std::vector<uint64_t> filteredRowIndices_;Update signatures:
- ClpQueryRunner::fetchNext(uint64_t, std::vector<uint64_t>&)
- ClpArchiveVectorLoader ctor to take std::vector<uint64_t>& or const&
- All call sites to pass the vector by reference
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/search_lib/BaseClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/BaseClpCursor.h
(3 hunks)velox/connectors/clp/search_lib/archive/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(4 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
📚 Learning: 2025-03-10T05:55:18.722Z
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.
Applied to files:
velox/connectors/clp/search_lib/archive/CMakeLists.txt
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
Applied to files:
velox/connectors/clp/search_lib/BaseClpCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/ClpDataSource.cpp
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Applied to files:
velox/connectors/clp/search_lib/BaseClpCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/ClpDataSource.cpp
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Applied to files:
velox/connectors/clp/search_lib/BaseClpCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/ClpDataSource.cpp
🧬 Code graph analysis (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (4)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (6)
fetchNext
(46-90)fetchNext
(46-46)getNumFilteredRows
(92-94)getNumFilteredRows
(92-92)loadSplit
(119-232)loadSplit
(119-119)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (3)
getNumFilteredRows
(45-58)clp_s
(21-23)clp_s
(25-28)velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
preprocessQuery
(42-77)preprocessQuery
(42-42)
velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
velox/connectors/clp/search_lib/BaseClpCursor.h (2)
clp_s
(27-29)clp_s
(31-33)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
clp_s
(21-23)clp_s
(25-28)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (4)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(27-29)clp_s
(31-33)currentSplitLoaded_
(122-135)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (4)
clp_s
(21-23)clp_s
(25-28)readerIndex_
(65-65)getNumFilteredRows
(45-58)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (4)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)vector
(76-78)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (5)
velox/connectors/clp/search_lib/BaseClpCursor.h (3)
clp_s
(27-29)clp_s
(31-33)BaseClpCursor
(68-137)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
clp_s
(28-30)velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (6)
clp_s
(21-26)clp_s
(28-31)clp_s
(33-35)ClpQueryRunner
(43-88)vector
(76-78)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (14)
ClpArchiveCursor
(34-38)ClpArchiveCursor
(40-44)fetchNext
(46-90)fetchNext
(46-46)getNumFilteredRows
(92-94)getNumFilteredRows
(92-92)createVector
(96-108)createVector
(96-99)loadSplit
(119-232)loadSplit
(119-119)getProjectedColumns
(110-117)getProjectedColumns
(111-111)createVectorHelper
(234-264)createVectorHelper
(234-238)velox/connectors/clp/search_lib/archive/ClpQueryRunner.cpp (2)
fetchNext
(60-78)fetchNext
(60-62)
velox/connectors/clp/ClpDataSource.cpp (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
splitPath
(67-73)splitPath
(67-70)velox/connectors/clp/ClpDataSource.h (1)
completedRows_
(81-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (8)
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (3)
19-26
: Includes look correct and canonical.Using fully-qualified include paths and required headers is consistent with neighbouring files.
34-38
: Constructor string_view is safe via BaseClpCursor’s owning copy.Forwarding std::string_view is fine since BaseClpCursor stores an owning std::string.
80-83
: Confirm fetchNext()’s contract (rows scanned vs. rows produced).You return rowsScanned but the caller consumes getNumFilteredRows() for materialization. Verify upstream (ClpDataSource) expects this; otherwise return the produced count to avoid confusion.
velox/connectors/clp/search_lib/archive/CMakeLists.txt (1)
24-27
: Link interface is correct.Making velox_vector PUBLIC ensures headers propagate to dependants.
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
17-26
: Includes are tight and correct.added; unused headers removed earlier. Looks good.
velox/connectors/clp/ClpDataSource.cpp (2)
23-24
: Includes normalization looks good.Paths are now consistent with the project’s absolute include style.
71-99
: ARRAY handling: confirm only arrays-of-strings are possible.Loader supports array-of-strings; if other element types can appear, we need element-aware mapping and loaders.
Would you like a follow-up patch to enforce/validate element types at planning time?
velox/connectors/clp/search_lib/BaseClpCursor.h (1)
19-26
: Includes and transitive dependency usage look correct.<retrieved_learnings> about InputSource via InputConfig.hpp and string_view→string copy pattern were applied appropriately.
|
||
#include <glog/logging.h> | ||
|
||
#include "velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we group the headers?
int32_t currentSchemaId_{-1}; | ||
size_t currentSchemaIndex_{0}; | ||
bool currentSchemaTableLoaded_{false}; | ||
std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use =
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably place it in the constructor
/// column reader. It supports various column types including integers, floats, | ||
/// booleans, strings, and arrays of strings. | ||
class ClpVectorLoader : public VectorLoader { | ||
class ClpArchiveVectorLoader : public VectorLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add archive to docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (2)
264-268
: Use-after-free risk: to_json_string() temporary backing store.simdjson::to_json_string(...) returns a temporary whose storage may not outlive the statement; storing string_view to it is unsafe. Use raw_json() (points into the padded input) or materialise to std::string before set().
- std::vector<std::string_view> arrayElements; - for (auto arrayElement : array) { - arrayElements.emplace_back( - simdjson::to_json_string(arrayElement).value()); - } + std::vector<std::string_view> arrayElements; + arrayElements.reserve(8); + for (auto value : array) { + // raw_json() views the original padded input; valid until 'padded' goes out of scope. + simdjson::raw_json_string r = value.raw_json(); + arrayElements.emplace_back(std::string_view(r)); + }If you prefer fully normalised JSON strings irrespective of input layout, copy:
- arrayElements.emplace_back(std::string_view(r)); + std::string s = std::string(std::string_view(r)); + arrayElements.emplace_back(*elements->stringAllocator()->copy(s));
202-207
: Handle ValueHook in loadInternal
Thehook
parameter is currently unused in ClpArchiveVectorLoader::loadInternal, violating the VectorLoader contract that non-null hooks must be called for each loaded value. Either implement hook support or add aVELOX_CHECK_NULL(hook);
to enforce only null hooks. (velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp:205–210)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (1)
80-86
: Potential issue with schema advancement logic.The current logic advances to the next schema whenever
filteredRowIndices_
is empty, but this could skip remaining data in a schema if a batch returns no matches. Schema advancement should only occur whenrowsScanned == 0
(indicating end-of-schema).Consider this fix to properly handle schema advancement:
auto rowsScanned = queryRunner_->fetchNext(numRows, filteredRowIndices_); if (false == filteredRowIndices_->empty()) { return rowsScanned; } - currentSchemaIndex_ += 1; - currentSchemaTableLoaded_ = false; + // Only advance schema if we've reached its end (no more rows to scan) + if (rowsScanned == 0) { + currentSchemaIndex_ += 1; + currentSchemaTableLoaded_ = false; + }
♻️ Duplicate comments (11)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (3)
19-19
: Hide simdjson from the public header; move TLS parser to the .cpp.Avoid leaking a heavy dependency and inline TLS initialisation in a public header. Forward-declare the parser type here and define/initialise the TLS in the .cpp.
-#include <simdjson.h> +// Avoid exposing simdjson in the public header. +namespace simdjson { namespace ondemand { class parser; } } @@ - inline static thread_local std::unique_ptr<simdjson::ondemand::parser> - arrayParser_ = std::make_unique<simdjson::ondemand::parser>(); + // Defined and initialised in ClpArchiveVectorLoader.cpp + static thread_local std::unique_ptr<simdjson::ondemand::parser> arrayParser_;And in ClpArchiveVectorLoader.cpp add:
+#include <simdjson.h> @@ +thread_local std::unique_ptr<simdjson::ondemand::parser> + facebook::velox::connector::clp::search_lib::ClpArchiveVectorLoader::arrayParser_ = + std::make_unique<simdjson::ondemand::parser>();Also applies to: 46-48
40-44
: Fix filteredRowIndices move semantics; make pointee const.The ctor takes a const shared_ptr and the member is non-const, so std::move in the .cpp won’t move. Pass by value (non-const) and store as shared_ptr<const vector<uint64_t>> to document immutability and enable efficient moves.
- ClpArchiveVectorLoader( - clp_s::BaseColumnReader* columnReader, - ColumnType nodeType, - const std::shared_ptr<std::vector<uint64_t>> filteredRowIndices); + ClpArchiveVectorLoader( + clp_s::BaseColumnReader* columnReader, + ColumnType nodeType, + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices); @@ - std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_; + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices_;Also applies to: 49-52
49-50
: Clarify ownership/lifetime of columnReader_.Raw pointer suggests non-ownership. Add a brief comment or switch to gsl::not_null<BaseColumnReader*>.
- clp_s::BaseColumnReader* columnReader_; + // Non-owning. Must remain valid for this object's lifetime. + clp_s::BaseColumnReader* columnReader_;velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (2)
248-255
: Propagate simdjson include and TLS definition into the .cpp.Once the header stops including simdjson and the TLS is only declared there, this TU must include simdjson.h and define arrayParser_.
+#include <simdjson.h> @@ +thread_local std::unique_ptr<simdjson::ondemand::parser> + facebook::velox::connector::clp::search_lib::ClpArchiveVectorLoader::arrayParser_ = + std::make_unique<simdjson::ondemand::parser>();
124-131
: Enable actual move of filteredRowIndices and enforce const pointee.With a const shared_ptr parameter, std::move performs a copy. Align the signature with the header change and move into the member.
-ClpArchiveVectorLoader::ClpArchiveVectorLoader( - clp_s::BaseColumnReader* columnReader, - ColumnType nodeType, - const std::shared_ptr<std::vector<uint64_t>> filteredRowIndices) +ClpArchiveVectorLoader::ClpArchiveVectorLoader( + clp_s::BaseColumnReader* columnReader, + ColumnType nodeType, + std::shared_ptr<const std::vector<uint64_t>> filteredRowIndices) : columnReader_(columnReader), nodeType_(nodeType), filteredRowIndices_(std::move(filteredRowIndices)) {}velox/connectors/clp/search_lib/BaseClpCursor.cpp (2)
33-39
: Return ErrorCode from executeQuery for better ergonomics.Callers can short-circuit on errors instead of probing state.
-void BaseClpCursor::executeQuery( +ErrorCode BaseClpCursor::executeQuery( const std::string& query, const std::vector<Field>& outputColumns) { query_ = query; outputColumns_ = outputColumns; errorCode_ = preprocessQuery(); + return errorCode_; }
54-56
: DRY the repeated “standardise + logically false” checks.Factor a tiny helper/lambda to apply a pass and test for EmptyExpr.
- OrOfAndForm standardizePass; - if (expr_ = standardizePass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + auto applyPass = [&](auto& pass) -> bool { + expr_ = pass.run(expr_); + return static_cast<bool>(std::dynamic_pointer_cast<EmptyExpr>(expr_)); + }; + OrOfAndForm standardizePass; + if (applyPass(standardizePass)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; } - NarrowTypes narrowPass; - if (expr_ = narrowPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + NarrowTypes narrowPass; + if (applyPass(narrowPass)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; } - ConvertToExists convertPass; - if (expr_ = convertPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + ConvertToExists convertPass; + if (applyPass(convertPass)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; }Also applies to: 61-63, 68-70
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (2)
67-67
: Consider markinggetProjectedColumns()
with[[nodiscard]]
.Since this accessor returns a reference that callers typically consume immediately, adding
[[nodiscard]]
would help catch accidental ignores at compile time.- const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const; + [[nodiscard]] const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const;
36-41
: Make the cursor non-copyable and non-movable to prevent resource management issues.Since the cursor manages shared resources like
archiveReader_
andfilteredRowIndices_
, allowing copy or move operations could lead to aliasing issues and potential double-close hazards.Add these declarations after the destructor:
~ClpArchiveCursor() override; + ClpArchiveCursor(const ClpArchiveCursor&) = delete; + ClpArchiveCursor& operator=(const ClpArchiveCursor&) = delete; + ClpArchiveCursor(ClpArchiveCursor&&) = delete; + ClpArchiveCursor& operator=(ClpArchiveCursor&&) = delete; + uint64_t fetchNext(uint64_t numRows) override;velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
120-122
: Fix C++17 compatibility issue with designated initializers.Designated initializers require C++20, but Velox is configured for C++17 (
CMAKE_CXX_STANDARD 17
). This will cause a compilation error.Apply this fix to make it C++17-compatible:
- auto networkAuthOption = inputSource_ == InputSource::Filesystem - ? NetworkAuthOption{.method = AuthMethod::None} - : NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4}; + NetworkAuthOption networkAuthOption{}; + networkAuthOption.method = (inputSource_ == InputSource::Filesystem) + ? AuthMethod::None + : AuthMethod::S3PresignedUrlV4;
249-250
: Remove unnecessary null allocation for LazyVector.Pre-allocating null bits for the base vector is wasteful since LazyVector will handle materialization. This unnecessarily allocates memory that may not be used.
auto vector = BaseVector::create(vectorType, vectorSize, pool); - vector->setNulls(allocateNulls(vectorSize, pool, bits::kNull));
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
velox/connectors/clp/search_lib/BaseClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
(4 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
(8 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
📚 Learning: 2025-08-20T18:00:51.241Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h:36-39
Timestamp: 2025-08-20T18:00:51.241Z
Learning: In ClpArchiveCursor.h, the clp_s::InputSource type is available through transitive inclusion via BaseClpCursor.h, which includes "clp_s/InputConfig.hpp" where InputSource is defined. The constructor parameter type is properly accessible without needing direct inclusion.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
📚 Learning: 2025-08-20T18:13:32.442Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:24-24
Timestamp: 2025-08-20T18:13:32.442Z
Learning: In ClpVectorLoader.cpp, the include of BaseClpCursor.h is necessary because the implementation uses ColumnType enum values (like ColumnType::Integer, ColumnType::Float) in a switch statement. While ClpVectorLoader.h has a forward declaration of ColumnType, the full definition from BaseClpCursor.h is required in the implementation file to access the enum values.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h
📚 Learning: 2025-08-20T16:58:56.256Z
Learnt from: anlowee
PR: y-scope/velox#25
File: velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp:32-35
Timestamp: 2025-08-20T16:58:56.256Z
Learning: In the BaseClpCursor class, the constructor takes std::string_view splitPath but stores it as an owning std::string member (splitPath_) using explicit conversion: splitPath_(std::string(splitPath)). This pattern safely avoids dangling reference issues when accepting string_view parameters.
Applied to files:
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp
🧬 Code graph analysis (1)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (1)
velox/vector/LazyVector.h (1)
VectorLoader
(177-210)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (7)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.h (2)
35-38
: Docstring accurately scopes to “archives”.Good clarification aligning with the new split types.
33-34
: Forward-declaring ColumnType in the header is appropriate.Definition is only needed in the .cpp where enum values are used.
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
289-301
: Type dispatch for Timestamp looks correct.Covers Integer, DateString, Float, and falls back to Unknown.
velox/connectors/clp/search_lib/BaseClpCursor.cpp (1)
41-52
: Parse and early-empty checks look good.Clear logging and appropriate ErrorCode mapping.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h (1)
45-45
: MarkgetNumFilteredRows()
as already const.Good work on making this method const-qualified. This properly indicates that it doesn't modify the object state.
velox/connectors/clp/search_lib/archive/ClpArchiveCursor.cpp (2)
226-230
: Good placement of stream opening and query runner initialization.The code now properly opens packed streams and initializes the query runner within
loadSplit()
, settingcurrentSplitLoaded_
only after successful initialization. This ensures proper resource management and prevents partial initialization issues.
40-44
: Proper resource cleanup in destructor.The destructor correctly checks
currentSplitLoaded_
before closing the archive reader, and with the stream opening now inloadSplit()
, this prevents resource leaks.
std::shared_ptr<clp_s::ArchiveReader> archiveReader_; | ||
int32_t currentSchemaId_{-1}; | ||
size_t currentSchemaIndex_{0}; | ||
bool currentSchemaTableLoaded_{false}; | ||
std::shared_ptr<std::vector<uint64_t>> filteredRowIndices_; | ||
std::vector<int32_t> matchedSchemas_; | ||
std::shared_ptr<clp_s::search::Projection> projection_; | ||
std::shared_ptr<ClpQueryRunner> queryRunner_; | ||
size_t readerIndex_{0}; | ||
std::shared_ptr<clp_s::search::SchemaMatch> schemaMatch_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider using a plain vector instead of shared_ptr
for filteredRowIndices_
.
The filteredRowIndices_
appears to have local ownership within this class. Using std::vector<uint64_t>
directly would eliminate unnecessary heap indirection and reference counting overhead.
If ClpQueryRunner::fetchNext
doesn't require shared ownership, consider refactoring to use a plain vector and pass by reference. This would require updating the ClpQueryRunner
API.
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/archive/ClpArchiveCursor.h around lines
56–65, replace the field declared as std::shared_ptr<std::vector<uint64_t>>
filteredRowIndices_ with a plain std::vector<uint64_t> filteredRowIndices_
because ownership is local; update all places that assigned or accessed
filteredRowIndices_ to use the vector directly (initialize empty, use .clear(),
.empty(), size(), operator[]/at(), iterators), change any nullptr checks to
emptiness checks, and update calls into ClpQueryRunner::fetchNext (and its
signature) to accept a reference or const-reference to std::vector<uint64_t>
instead of a shared_ptr so you avoid unnecessary heap indirection and
refcounting; finally remove any now-unused shared_ptr includes and ensure
compilation by adjusting includes and call sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
ClpCursor
interface and modify ClpVectorLoader
to prepare support IR integration.BaseClpCursor
interface to support upcoming IR integration.
b8269e4
into
y-scope:presto-0.293-clp-connector
Description
This PR mainly refactors the current
ClpCursor
to extract an interfaceBaseClpCursor
.The original implmentation of
ClpCursor
now is moved toarchive/ClpArchiveCursor
. Later, we will create another packageir
and put the IR cursor implementation there.There is also a term change, the
archive
term used by originalClpCursor
now is renamed to split, to involve IR concept as well.Checklist
breaking change.
Validation performed
Passed the unit tests.
Passed the CI.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores