-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add CLP connector. #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add CLP connector. #14
Conversation
## Walkthrough
A new CLP connector is introduced into the Velox codebase, including its source, build, and test files. The build system is extended to support this connector and its dependencies, with new CMake modules for external libraries. Documentation is updated to describe the CLP connector and its configuration.
## Changes
| File(s) | Change Summary |
|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| CMakeLists.txt, velox/connectors/CMakeLists.txt, velox/connectors/clp/CMakeLists.txt | Added build logic and conditional inclusion for the CLP connector and its dependencies. |
| CMake/resolve_dependency_modules/*.cmake | New CMake modules added for dependency resolution: antlr4-runtime, clp, date, msgpack-cxx, nlohmann_json, spdlog, ystdlib_cpp. |
| CMake/resolve_dependency_modules/simdjson.cmake | Modified to use OVERRIDE_FIND_PACKAGE for simdjson dependency. |
| velox/connectors/clp/ClpColumnHandle.h | Added new ClpColumnHandle class for column metadata. |
| velox/connectors/clp/ClpConfig.h, velox/connectors/clp/ClpConfig.cpp | Added ClpConfig class for connector configuration and storage type handling. |
| velox/connectors/clp/ClpConnector.h, velox/connectors/clp/ClpConnector.cpp | Implemented ClpConnector and ClpConnectorFactory classes for connector creation and configuration. |
| velox/connectors/clp/ClpConnectorSplit.h | Introduced ClpConnectorSplit struct for representing data splits. |
| velox/connectors/clp/ClpDataSource.h, velox/connectors/clp/ClpDataSource.cpp | Implemented ClpDataSource for data reading, batch processing, and integration with CLP archives. |
| velox/connectors/clp/ClpTableHandle.h, velox/connectors/clp/ClpTableHandle.cpp | Added ClpTableHandle class for table and query metadata. |
| velox/connectors/clp/search_lib/CMakeLists.txt | Added static library target clp-s-search for CLP search functionality. |
| velox/connectors/clp/search_lib/ClpCursor.h, velox/connectors/clp/search_lib/ClpCursor.cpp | Implemented ClpCursor for query execution and row retrieval from CLP archives. |
| velox/connectors/clp/search_lib/ClpQueryRunner.h, velox/connectors/clp/search_lib/ClpQueryRunner.cpp | Implemented ClpQueryRunner for schema-based filtering and projected column reading. |
| velox/connectors/clp/search_lib/ClpVectorLoader.h, velox/connectors/clp/search_lib/ClpVectorLoader.cpp | Added ClpVectorLoader for loading Velox vectors from CLP column readers, including array and primitive types. |
| velox/connectors/clp/tests/CMakeLists.txt | Added test build configuration for CLP connector tests. |
| velox/connectors/clp/tests/ClpConnectorTest.cpp | Added unit tests for CLP connector, covering pushdown, hybrid, and nested filtering scenarios. |
| velox/connectors/clp/tests/examples/test_1.ndjson, velox/connectors/clp/tests/examples/test_2.ndjson | Added example NDJSON data files for test cases. |
| velox/docs/develop/connectors.rst | Updated documentation to describe the CLP connector, its architecture, and usage. |
| velox/docs/configs.rst | Added documentation for the CLP connector configuration property `clp.storage-type`. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant QueryEngine
participant ClpConnectorFactory
participant ClpConnector
participant ClpDataSource
participant ClpCursor
participant ClpVectorLoader
QueryEngine->>ClpConnectorFactory: newConnector(id, config)
ClpConnectorFactory->>ClpConnector: Construct with id, config
QueryEngine->>ClpConnector: createDataSource(outputType, tableHandle, columnHandles, ctx)
ClpConnector->>ClpDataSource: Construct with schema, handles, pool, config
QueryEngine->>ClpDataSource: addSplit(split)
ClpDataSource->>ClpCursor: Construct with inputSource, archivePath
ClpDataSource->>ClpCursor: executeQuery(query, outputColumns)
loop For each batch
QueryEngine->>ClpDataSource: next(batchSize, future)
ClpDataSource->>ClpCursor: fetchNext(batchSize, filteredRowIndices)
alt rows found
ClpDataSource->>ClpVectorLoader: createVector for each projected column
ClpVectorLoader-->>ClpDataSource: Loaded Velox vector
ClpDataSource-->>QueryEngine: RowVectorPtr
else no rows
ClpDataSource-->>QueryEngine: std::nullopt
end
end sequenceDiagram
participant ClpDataSource
participant ClpCursor
participant ClpQueryRunner
ClpDataSource->>ClpCursor: executeQuery(query, outputColumns)
ClpCursor->>ClpQueryRunner: init(schemaReader, columnMap)
loop For each fetchNext call
ClpDataSource->>ClpCursor: fetchNext(numRows, filteredRowIndices)
ClpCursor->>ClpQueryRunner: fetchNext(numRows, filteredRowIndices)
ClpQueryRunner-->>ClpCursor: filteredRowIndices
ClpCursor-->>ClpDataSource: projectedColumns
end
Suggested reviewers
|
c5bba79
to
b0e5211
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 31
🔭 Outside diff range comments (3)
velox/connectors/clp/CMakeLists.txt (1)
32-32
: Add missing newline at end of file.The pipeline failure indicates a missing newline at the end of the file.
add_subdirectory(tests) endif() +
CMake/resolve_dependency_modules/clp.cmake (1)
48-48
: Add missing newline at end of file.Based on the pipeline failure pattern, this file is likely missing a newline at the end.
add_subdirectory(${clp_SOURCE_DIR}/components/core ${clp_BINARY_DIR}/components/core) +
CMake/resolve_dependency_modules/ystdlib_cpp.cmake (1)
25-25
: Add missing newline at end of file.Based on the pipeline failure pattern, this file is likely missing a newline at the end.
set(CLP_YSTDLIB_SOURCE_DIRECTORY "${ystdlib_cpp_SOURCE_DIR}") +
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (33)
CMake/resolve_dependency_modules/antlr4-runtime.cmake
(1 hunks)CMake/resolve_dependency_modules/clp.cmake
(1 hunks)CMake/resolve_dependency_modules/date.cmake
(1 hunks)CMake/resolve_dependency_modules/msgpack-cxx.cmake
(1 hunks)CMake/resolve_dependency_modules/nlohmann_json.cmake
(1 hunks)CMake/resolve_dependency_modules/simdjson.cmake
(1 hunks)CMake/resolve_dependency_modules/spdlog.cmake
(1 hunks)CMake/resolve_dependency_modules/ystdlib_cpp.cmake
(1 hunks)CMakeLists.txt
(4 hunks)velox/connectors/CMakeLists.txt
(1 hunks)velox/connectors/clp/CMakeLists.txt
(1 hunks)velox/connectors/clp/ClpColumnHandle.h
(1 hunks)velox/connectors/clp/ClpConfig.cpp
(1 hunks)velox/connectors/clp/ClpConfig.h
(1 hunks)velox/connectors/clp/ClpConnector.cpp
(1 hunks)velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/ClpConnectorSplit.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/ClpDataSource.h
(1 hunks)velox/connectors/clp/ClpTableHandle.cpp
(1 hunks)velox/connectors/clp/ClpTableHandle.h
(1 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.h
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.h
(1 hunks)velox/connectors/clp/tests/CMakeLists.txt
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)velox/connectors/clp/tests/examples/test_1.ndjson
(1 hunks)velox/connectors/clp/tests/examples/test_2.ndjson
(1 hunks)velox/docs/develop/connectors.rst
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
velox/connectors/clp/ClpConnectorSplit.h (3)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(42-45)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/clp/ClpTableHandle.h (2)
nodiscard
(32-34)nodiscard
(36-38)
velox/connectors/clp/ClpTableHandle.h (4)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(42-45)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(26-28)velox/connectors/hive/TableHandle.h (1)
tableName_
(146-148)
velox/connectors/clp/ClpConfig.h (3)
velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/clp/ClpTableHandle.h (2)
nodiscard
(32-34)nodiscard
(36-38)velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(26-28)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(42-45)velox/connectors/clp/ClpTableHandle.h (2)
nodiscard
(32-34)nodiscard
(36-38)velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(26-28)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
outputType
(38-44)
🪛 GitHub Actions: Linux Build using GCC
CMake/resolve_dependency_modules/nlohmann_json.cmake
[error] 34-34: CMake Error: URL_HASH is set to a raw hash string but must be in the format ALGO=value where ALGO is one of MD5, SHA1, SHA224, SHA256, SHA384, SHA512, SHA3_224, SHA3_256, SHA3_384, SHA3_512. This caused the nlohmann_json dependency fetch to fail.
🪛 GitHub Actions: Run Checks
CMakeLists.txt
[error] 1-1: end-of-file-fixer hook failed and modified this file to fix missing newline at end of file.
velox/connectors/clp/search_lib/CMakeLists.txt
[error] 1-1: end-of-file-fixer hook failed and modified this file to fix missing newline at end of file.
velox/docs/develop/connectors.rst
[error] 147-147: end-of-file-fixer hook failed and fixed missing newline at end of file.
🔇 Additional comments (35)
CMake/resolve_dependency_modules/date.cmake (1)
32-38
: LGTM - FetchContent configuration is properly structured.The FetchContent declaration correctly uses the resolved URL, includes SHA256 verification, and sets OVERRIDE_FIND_PACKAGE for consistent dependency resolution.
CMake/resolve_dependency_modules/simdjson.cmake (1)
30-31
: LGTM - Proper dependency override configuration.Adding
OVERRIDE_FIND_PACKAGE
ensures consistent usage of the fetched simdjson version over any system-installed packages, which is the correct approach for reproducible builds.velox/connectors/CMakeLists.txt (1)
20-23
: LGTM - Consistent conditional build pattern.The CLP connector integration follows the established pattern used for other connectors (Hive, TPC-H), ensuring consistent build behaviour.
velox/connectors/clp/tests/CMakeLists.txt (2)
14-25
: LGTM - Well-structured test configuration.The test setup correctly creates the executable, links all necessary dependencies (connector, test libraries, GTest), and registers with CTest for automated testing.
26-27
: LGTM - Proper test data setup.Copying example data files to the build directory ensures test data is available at runtime for the CLP connector tests.
velox/connectors/clp/tests/examples/test_1.ndjson (1)
1-10
: LGTM - Comprehensive test data coverage.The NDJSON test data provides excellent coverage with diverse HTTP request scenarios, including various methods, authentication states, and realistic response characteristics. The consistent schema and mix of null/non-null values will effectively test the CLP connector's data handling capabilities.
CMakeLists.txt (2)
130-130
: LGTM! CLP connector option added correctly.The new build option follows the established pattern and is properly integrated with the minimal build configuration.
525-552
: LGTM! Dependency resolution follows established patterns.The conditional dependency resolution for the CLP connector is well-structured and follows the existing patterns in the codebase. The dependencies appear comprehensive for a log processing connector.
velox/connectors/clp/tests/examples/test_2.ndjson (1)
1-10
: LGTM! Comprehensive test data for CLP connector.The test data provides good coverage of different event types and JSON structures, which will be valuable for testing the CLP connector functionality.
CMake/resolve_dependency_modules/clp.cmake (1)
16-47
: LGTM! Well-structured dependency resolution.The CLP dependency is properly configured with specific commit hash for reproducibility and appropriate build options disabled for the connector use case.
CMake/resolve_dependency_modules/ystdlib_cpp.cmake (1)
16-24
: LGTM! Proper dependency resolution structure.The ystdlib_cpp dependency is correctly configured with specific commit hash and appropriate variable setting for the CLP connector.
velox/connectors/clp/search_lib/CMakeLists.txt (1)
17-17
: Verify that theCLP_SRC_FILES
variable is defined.The variable
${CLP_SRC_FILES}
is referenced but not defined in this file. Ensure this variable is properly set in a parent CMake file or consider removing it if unused.#!/bin/bash # Description: Search for CLP_SRC_FILES definition in CMake files # Expected: Find where CLP_SRC_FILES is defined rg -A 3 -B 3 "CLP_SRC_FILES" --type cmakevelox/connectors/clp/ClpConnectorSplit.h (1)
22-31
: Well-designed split implementation.The struct follows good design principles with immutable data members, proper inheritance, and clear string representation.
CMake/resolve_dependency_modules/msgpack-cxx.cmake (1)
14-33
: Well-structured dependency module.This CMake module follows excellent practices with version pinning, SHA256 checksum verification, proper FetchContent usage, and include guards. The dependency resolution is secure and reproducible.
CMake/resolve_dependency_modules/nlohmann_json.cmake (1)
30-32
: Good practice disabling tests for external dependency.Properly disabling JSON tests to avoid building unnecessary components for the external dependency.
CMake/resolve_dependency_modules/spdlog.cmake (2)
33-34
: LGTM: External fmt configuration is appropriate.Setting
SPDLOG_FMT_EXTERNAL ON
before making the content available is a good practice to use an external fmt library rather than spdlog's bundled version, which helps avoid potential conflicts.
16-21
: Verify spdlog v1.12.0 checksum using PythonSince the sandbox lacks both
sha256sum
andopenssl
, let’s compute the SHA-256 via Python:#!/bin/bash URL="https://github.com/gabime/spdlog/archive/refs/tags/v1.12.0.tar.gz" EXPECTED="4dccf2d10f410c1e2feaff89966bfc49a1abb29ef6f08246335b110e001e09a9" echo "Downloading and verifying spdlog v1.12.0 checksum via Python..." ACTUAL=$(curl -sL "$URL" | python3 - << 'EOF' import sys, hashlib data = sys.stdin.buffer.read() print(hashlib.sha256(data).hexdigest()) EOF ) echo "Expected: $EXPECTED" echo "Actual: $ACTUAL" if [ "$ACTUAL" = "$EXPECTED" ]; then echo "✓ Checksum matches" else echo "✗ Checksum mismatch" fiCMake/resolve_dependency_modules/antlr4-runtime.cmake (1)
16-21
: I noticedsha256sum
isn’t available in the sandbox, so let’s re-run the checksum verification using Python’s hashlib:#!/bin/bash # Description: Verify the SHA256 checksum for ANTLR4 v4.13.2 using Python URL="https://github.com/antlr/antlr4/archive/refs/tags/4.13.2.tar.gz" EXPECTED_CHECKSUM="9f18272a9b32b622835a3365f850dd1063d60f5045fb1e12ce475ae6e18a35bb" echo "Downloading ANTLR4 v4.13.2 and computing SHA256..." ACTUAL_CHECKSUM=$(curl -sL "$URL" | python3 - << 'EOF' import sys, hashlib data = sys.stdin.buffer.read() print(hashlib.sha256(data).hexdigest()) EOF ) echo "Expected: $EXPECTED_CHECKSUM" echo "Actual: $ACTUAL_CHECKSUM" if [ "$ACTUAL_CHECKSUM" = "$EXPECTED_CHECKSUM" ]; then echo "✓ Checksum matches" else echo "✗ Checksum mismatch" fivelox/connectors/clp/ClpColumnHandle.h (1)
34-48
: LGTM: Well-designed getter methods.The getter methods follow best practices:
- Return const references for strings and TypePtr to avoid unnecessary copies
- Return by value for the boolean (which is efficient)
- Marked as const methods
- Use
[[nodiscard]]
attributes would be beneficial but not strictly necessaryvelox/connectors/clp/ClpConfig.cpp (3)
23-32
: LGTM: Robust string-to-enum conversion.The
stringToStorageType
function correctly handles:
- Case-insensitive comparison using
boost::algorithm::to_upper_copy
- Clear error messaging with the original input value
- Support for both "FS" and "S3" storage types
36-38
: LGTM: Appropriate default value handling.The
storageType()
method properly uses "FS" as the default storage type when the configuration key is not present, which is a sensible default for filesystem-based storage.
17-17
: Verify boost algorithm header inclusion.The code uses
boost::algorithm::to_upper_copy
but doesn't explicitly include the required header. Ensure thatClpConfig.h
or a transitive dependency includes<boost/algorithm/string.hpp>
.#!/bin/bash # Description: Check if boost algorithm header is included in ClpConfig.h or its dependencies # Expected: The header should be included somewhere in the dependency chain echo "Checking for boost algorithm includes in ClpConfig.h..." if [ -f "velox/connectors/clp/ClpConfig.h" ]; then grep -n "boost.*algorithm" velox/connectors/clp/ClpConfig.h || echo "Not found in ClpConfig.h" fi echo "Checking for boost algorithm includes in related connector headers..." fd -e h . velox/connectors/clp/ --exec grep -l "boost.*algorithm" {} \; 2>/dev/null || echo "No boost algorithm includes found"velox/connectors/clp/ClpTableHandle.h (2)
24-30
: LGTM: Well-designed constructor with move semantics.The constructor properly:
- Initialises the base class with connectorId
- Uses move semantics for the shared_ptr parameter to avoid unnecessary copying
- Initialises all members in the member initialiser list
32-38
: LGTM: Consistent getter design.Both getter methods follow good practices:
- Marked with
[[nodiscard]]
to prevent accidental ignoring of return values- Return const references to avoid unnecessary copying
- Marked as const methods
velox/connectors/clp/ClpConfig.h (1)
47-47
: ```shell
#!/bin/bashVerify storageType() definition in ClpConfig.cpp
grep -R "StorageType\s+ClpConfig::storageType" -n velox/connectors/clp/ClpConfig.cpp
grep -R "storageType() *{" -n velox/connectors/clp/ClpConfig.cpp</details> <details> <summary>velox/docs/develop/connectors.rst (1)</summary> `125-150`: **LGTM - Comprehensive CLP connector documentation.** The documentation provides excellent coverage of the CLP connector architecture, including detailed explanations of ClpConnectorSplit, ClpDataSource initialization, query processing, and the lazy vector loading mechanism. The content is well-structured and informative. </details> <details> <summary>velox/connectors/clp/search_lib/ClpQueryRunner.cpp (2)</summary> `26-60`: **LGTM - Well-structured initialization logic.** The `init` method correctly handles projection mapping by iterating through ordered matching nodes and finding corresponding column readers. The fallback to nullptr for missing readers is appropriate and allows for robust handling of incomplete projections. --- `62-80`: **LGTM - Efficient message filtering implementation.** The `fetchNext` method implements an efficient filtering loop that respects the requested row count while scanning messages. The logic correctly accumulates filtered row indices and tracks the number of rows scanned for performance monitoring. </details> <details> <summary>velox/connectors/clp/ClpConnector.cpp (3)</summary> `23-26`: **LGTM - Clean connector initialization.** The constructor correctly initializes the base class and wraps the provided config in a ClpConfig instance for type-safe configuration access. --- `28-41`: **LGTM - Proper data source creation.** The `createDataSource` method correctly instantiates ClpDataSource with all necessary parameters including output type, table handle, column handles, memory pool, and configuration. --- `43-49`: **LGTM - Appropriate handling of unsupported functionality.** The `createDataSink` method correctly throws a NYI exception since the CLP connector is designed for read-only operations. This is consistent with the documentation stating that CLP connector doesn't implement the DataSink interface. </details> <details> <summary>velox/connectors/clp/search_lib/ClpVectorLoader.h (3)</summary> `24-34`: **LGTM - Well-designed VectorLoader specialization.** The class provides a clean interface for loading CLP data into Velox vectors. The constructor parameters appropriately capture the column reader, type information, and filtered indices needed for selective data loading. --- `50-52`: **LGTM - Efficient thread-local JSON parser.** The thread-local static parser is a good optimization for JSON parsing, avoiding the overhead of creating parser instances for each array operation while maintaining thread safety. --- `43-44`: **LGTM - Flexible templated data population.** The templated `populateData` method provides a flexible approach for handling different vector types while maintaining type safety. This design supports the various data types mentioned in the class documentation. </details> <details> <summary>velox/connectors/clp/search_lib/ClpVectorLoader.cpp (1)</summary> `31-54`: **Fix potential memory safety issue with StringView** The code creates a `StringView` from a temporary `std::string` returned by `extract_value`. This is unsafe because `StringView` doesn't own the data and the temporary string will be destroyed after line 46, leaving the `StringView` pointing to deallocated memory. ```diff if constexpr (std::is_same_v<T, std::string>) { - auto string_value = - std::get<std::string>(columnReader_->extract_value(messageIndex)); - vector->set(vectorIndex, StringView(string_value)); + // StringView requires the underlying string to remain valid. + // The FlatVector should handle string storage internally. + vector->set(vectorIndex, std::get<std::string>(columnReader_->extract_value(messageIndex))); } else {
Likely an incorrect or invalid review comment.
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
#set(CLP_SRC_DIR ${clp_SOURCE_DIR}/components/core/src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Remove commented code.
This commented line appears to be leftover code and should be removed to keep the build file clean.
-#set(CLP_SRC_DIR ${clp_SOURCE_DIR}/components/core/src)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#set(CLP_SRC_DIR ${clp_SOURCE_DIR}/components/core/src) |
🤖 Prompt for AI Agents
In velox/connectors/clp/CMakeLists.txt at line 14, remove the commented out line
setting CLP_SRC_DIR as it is leftover code and not needed, to keep the build
file clean and maintainable.
clp_s::search | ||
clp_s::search::kql | ||
velox_vector) | ||
target_compile_features(clp-s-search PRIVATE cxx_std_20) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the missing newline at end of file.
The pipeline failure indicates a missing newline at the end of the file.
-target_compile_features(clp-s-search PRIVATE cxx_std_20)
+target_compile_features(clp-s-search PRIVATE cxx_std_20)
+
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
target_compile_features(clp-s-search PRIVATE cxx_std_20) | |
target_compile_features(clp-s-search PRIVATE cxx_std_20) | |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/CMakeLists.txt at line 34, the file is
missing a newline at the end, causing pipeline failure. Add a newline character
after the last line to ensure the file ends properly.
: connector::ConnectorSplit(connectorId), path_(path) {} | ||
|
||
[[nodiscard]] std::string toString() const override { | ||
return fmt::format("CLP Split: {}", path_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing include for fmt::format.
The fmt::format
function is used but the required header is not included.
#pragma once
#include "velox/connectors/Connector.h"
+#include <fmt/format.h>
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return fmt::format("CLP Split: {}", path_); | |
#pragma once | |
#include "velox/connectors/Connector.h" | |
#include <fmt/format.h> | |
// ... rest of ClpConnectorSplit.h remains unchanged ... |
🤖 Prompt for AI Agents
In velox/connectors/clp/ClpConnectorSplit.h at line 27, the code uses
fmt::format but does not include the necessary header. Add the appropriate
include directive for the fmt library, typically #include <fmt/format.h>, at the
top of the file to ensure fmt::format is recognized and compiles correctly.
static std::string getExampleFilePath(const std::string& filePath) { | ||
std::string current_path = fs::current_path().string(); | ||
return current_path + "/examples/" + filePath; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Verification inconclusive
Avoid using fs::current_path()
in tests
Using fs::current_path()
makes tests dependent on the working directory, which can lead to flaky tests when run from different locations. Consider using a more robust approach to locate test files.
Consider these alternatives:
- Use an environment variable or CMake-defined path
- Use
__FILE__
to determine the test file location and construct relative paths from there - Embed test data as string literals in the test file
Here's an example using __FILE__
:
static std::string getExampleFilePath(const std::string& filePath) {
// Get the directory containing this test file
std::string testFilePath = __FILE__;
auto lastSlash = testFilePath.find_last_of("/\\");
std::string testDir = testFilePath.substr(0, lastSlash);
// Construct path relative to test directory
return testDir + "/examples/" + filePath;
}
Avoid using fs::current_path()
in tests
Using fs::current_path()
makes tests dependent on the working directory, which can lead to flaky tests when run from different locations. Consider using a more robust approach to locate test files.
Consider these alternatives:
- Use an environment variable or CMake-defined path
- Use
__FILE__
to determine the test file location and construct relative paths from there - Embed test data as string literals in the test file
Here's an example using __FILE__
:
static std::string getExampleFilePath(const std::string& filePath) {
// Get the directory containing this test file
std::string testFilePath = __FILE__;
auto lastSlash = testFilePath.find_last_of("/\\");
std::string testDir = testFilePath.substr(0, lastSlash);
// Construct path relative to test directory
return testDir + "/examples/" + filePath;
}
🤖 Prompt for AI Agents
In velox/connectors/clp/tests/ClpConnectorTest.cpp around lines 76 to 79,
replace the use of fs::current_path() with a method that derives the path
relative to the test file location using __FILE__. Extract the directory path
from __FILE__ by finding the last slash, then construct the example file path
relative to that directory to avoid dependency on the working directory and make
the tests more robust.
ErrorCode ClpCursor::loadArchive() { | ||
auto networkAuthOption = inputSource_ == InputSource::Filesystem | ||
? NetworkAuthOption{.method = AuthMethod::None} | ||
: NetworkAuthOption{.method = AuthMethod::S3PresignedUrlV4}; | ||
|
||
try { | ||
archiveReader_->open( | ||
get_path_object_for_raw_path(archivePath_), networkAuthOption); | ||
} catch (std::exception& e) { | ||
VLOG(2) << "Failed to open archive file: " << e.what(); | ||
return ErrorCode::InternalError; | ||
} | ||
|
||
auto timestampDict = archiveReader_->get_timestamp_dictionary(); | ||
auto schemaTree = archiveReader_->get_schema_tree(); | ||
auto schemaMap = archiveReader_->get_schema_map(); | ||
|
||
EvaluateTimestampIndex timestampIndex(timestampDict); | ||
if (clp_s::EvaluatedValue::False == timestampIndex.run(expr_)) { | ||
VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; | ||
return ErrorCode::InvalidTimestampRange; | ||
} | ||
|
||
schemaMatch_ = std::make_shared<SchemaMatch>(schemaTree, schemaMap); | ||
if (expr_ = schemaMatch_->run(expr_); | ||
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | ||
VLOG(2) << "No matching schemas for query '" << query_ << "'"; | ||
return ErrorCode::SchemaNotFound; | ||
} | ||
|
||
projection_ = std::make_shared<Projection>( | ||
outputColumns_.empty() ? ReturnAllColumns : ReturnSelectedColumns); | ||
try { | ||
for (auto const& column : outputColumns_) { | ||
std::vector<std::string> descriptorTokens; | ||
std::string descriptorNamespace; | ||
if (false == | ||
tokenize_column_descriptor( | ||
column.name, descriptorTokens, descriptorNamespace)) { | ||
VLOG(2) << "Can not tokenize invalid column: '" << column.name << "'"; | ||
return ErrorCode::InternalError; | ||
} | ||
|
||
auto columnDescriptor = ColumnDescriptor::create_from_escaped_tokens( | ||
descriptorTokens, descriptorNamespace); | ||
switch (column.type) { | ||
case ColumnType::String: | ||
columnDescriptor->set_matching_types( | ||
LiteralType::ClpStringT | LiteralType::VarStringT | | ||
LiteralType::EpochDateT); | ||
break; | ||
case ColumnType::Integer: | ||
columnDescriptor->set_matching_types(LiteralType::IntegerT); | ||
break; | ||
case ColumnType::Float: | ||
columnDescriptor->set_matching_types(LiteralType::FloatT); | ||
break; | ||
case ColumnType::Boolean: | ||
columnDescriptor->set_matching_types(LiteralType::BooleanT); | ||
break; | ||
case ColumnType::Array: | ||
columnDescriptor->set_matching_types(LiteralType::ArrayT); | ||
break; | ||
default: | ||
break; | ||
} | ||
|
||
projection_->add_column(columnDescriptor); | ||
} | ||
} catch (TraceableException& e) { | ||
VLOG(2) << e.what(); | ||
return ErrorCode::InternalError; | ||
} | ||
projection_->resolve_columns(schemaTree); | ||
archiveReader_->set_projection(projection_); | ||
|
||
archiveReader_->read_metadata(); | ||
|
||
matchedSchemas_.clear(); | ||
for (auto schemaId : archiveReader_->get_schema_ids()) { | ||
if (schemaMatch_->schema_matched(schemaId)) { | ||
matchedSchemas_.push_back(schemaId); | ||
} | ||
} | ||
|
||
if (matchedSchemas_.empty()) { | ||
return ErrorCode::SchemaNotFound; | ||
} | ||
|
||
EvaluateTimestampIndex timestamp_index(timestampDict); | ||
if (EvaluatedValue::False == timestamp_index.run(expr_)) { | ||
VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; | ||
return ErrorCode::InvalidTimestampRange; | ||
} | ||
|
||
archiveReader_->read_variable_dictionary(); | ||
archiveReader_->read_log_type_dictionary(); | ||
archiveReader_->read_array_dictionary(); | ||
|
||
currentSchemaIndex_ = 0; | ||
currentSchemaTableLoaded_ = false; | ||
return ErrorCode::Success; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider refactoring loadArchive
into smaller methods
This method is over 100 lines long and handles multiple responsibilities. Consider breaking it down into smaller, focused methods for better maintainability and testability:
openArchiveWithAuth()
evaluateTimestampConstraints()
matchSchemas()
setupProjection()
loadDictionaries()
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.cpp around lines 150 to 252, the
loadArchive method is too long and handles multiple responsibilities. Refactor
this method by extracting logical sections into smaller private methods: create
openArchiveWithAuth() to handle archive opening with authentication,
evaluateTimestampConstraints() to run timestamp index evaluation, matchSchemas()
to perform schema matching and filtering, setupProjection() to configure the
projection based on output columns, and loadDictionaries() to read variable, log
type, and array dictionaries. Then update loadArchive to call these new methods
sequentially, improving readability and maintainability.
EvaluateTimestampIndex timestamp_index(timestampDict); | ||
if (EvaluatedValue::False == timestamp_index.run(expr_)) { | ||
VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; | ||
return ErrorCode::InvalidTimestampRange; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate timestamp evaluation
This timestamp evaluation is redundant as the same check is already performed at lines 167-171. The expression and timestamp dictionary haven't changed between these two checks.
- EvaluateTimestampIndex timestamp_index(timestampDict);
- if (EvaluatedValue::False == timestamp_index.run(expr_)) {
- VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
- return ErrorCode::InvalidTimestampRange;
- }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
EvaluateTimestampIndex timestamp_index(timestampDict); | |
if (EvaluatedValue::False == timestamp_index.run(expr_)) { | |
VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; | |
return ErrorCode::InvalidTimestampRange; | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.cpp between lines 239 and 243,
remove the redundant evaluation of the timestamp index since the same check is
already done at lines 167-171 with unchanged expression and timestamp
dictionary. Simply delete the block that creates EvaluateTimestampIndex and runs
the check to avoid duplicate evaluation.
if (EvaluatedValue::False == | ||
queryRunner_->schema_init(currentSchemaId_)) { | ||
currentSchemaIndex_ += 1; | ||
currentSchemaTableLoaded_ = false; | ||
errorCode_ = ErrorCode::DictionaryNotFound; | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix error code handling in schema iteration
Setting errorCode_
to DictionaryNotFound
at line 80 is problematic because it gets overwritten in the next iteration. If all schemas fail, the final error code might not reflect the actual issues encountered.
Consider tracking schema-specific errors or logging them without overwriting the error code:
if (EvaluatedValue::False ==
queryRunner_->schema_init(currentSchemaId_)) {
+ LOG(WARNING) << "Dictionary not found for schema " << currentSchemaId_;
currentSchemaIndex_ += 1;
currentSchemaTableLoaded_ = false;
- errorCode_ = ErrorCode::DictionaryNotFound;
continue;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (EvaluatedValue::False == | |
queryRunner_->schema_init(currentSchemaId_)) { | |
currentSchemaIndex_ += 1; | |
currentSchemaTableLoaded_ = false; | |
errorCode_ = ErrorCode::DictionaryNotFound; | |
continue; | |
} | |
if (EvaluatedValue::False == | |
queryRunner_->schema_init(currentSchemaId_)) { | |
LOG(WARNING) << "Dictionary not found for schema " << currentSchemaId_; | |
currentSchemaIndex_ += 1; | |
currentSchemaTableLoaded_ = false; | |
continue; | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.cpp around lines 76 to 82, the
errorCode_ is set to DictionaryNotFound inside a loop but gets overwritten in
subsequent iterations, causing loss of error context if all schemas fail. Modify
the code to track errors per schema or log each failure without overwriting
errorCode_, and after the loop completes, set errorCode_ to reflect the most
relevant or aggregated error status.
if (expr_ = standardizePass.run(expr_); | ||
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | ||
VLOG(2) << "Query '" << query_ << "' is logically false"; | ||
return ErrorCode::LogicalError; | ||
} | ||
|
||
NarrowTypes narrowPass; | ||
if (expr_ = narrowPass.run(expr_); | ||
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | ||
VLOG(2) << "Query '" << query_ << "' is logically false"; | ||
return ErrorCode::LogicalError; | ||
} | ||
|
||
ConvertToExists convertPass; | ||
if (expr_ = convertPass.run(expr_); | ||
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | ||
VLOG(2) << "Query '" << query_ << "' is logically false"; | ||
return ErrorCode::LogicalError; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Refactor assignments inside if conditions for clarity
The assignments inside if conditions (lines 127, 134, 141) are valid but can be confusing. Consider separating the assignment from the condition check for better readability.
- if (expr_ = standardizePass.run(expr_);
- std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+ expr_ = standardizePass.run(expr_);
+ if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
VLOG(2) << "Query '" << query_ << "' is logically false";
return ErrorCode::LogicalError;
}
NarrowTypes narrowPass;
- if (expr_ = narrowPass.run(expr_);
- std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+ expr_ = narrowPass.run(expr_);
+ if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
VLOG(2) << "Query '" << query_ << "' is logically false";
return ErrorCode::LogicalError;
}
ConvertToExists convertPass;
- if (expr_ = convertPass.run(expr_);
- std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+ expr_ = convertPass.run(expr_);
+ if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
VLOG(2) << "Query '" << query_ << "' is logically false";
return ErrorCode::LogicalError;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (expr_ = standardizePass.run(expr_); | |
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} | |
NarrowTypes narrowPass; | |
if (expr_ = narrowPass.run(expr_); | |
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} | |
ConvertToExists convertPass; | |
if (expr_ = convertPass.run(expr_); | |
std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} | |
expr_ = standardizePass.run(expr_); | |
if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} | |
NarrowTypes narrowPass; | |
expr_ = narrowPass.run(expr_); | |
if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} | |
ConvertToExists convertPass; | |
expr_ = convertPass.run(expr_); | |
if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { | |
VLOG(2) << "Query '" << query_ << "' is logically false"; | |
return ErrorCode::LogicalError; | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.cpp around lines 127 to 145, the
code assigns values to expr_ inside if conditions, which reduces clarity.
Refactor by first assigning expr_ the result of each pass.run(expr_) call on
separate lines, then use a separate if statement to check if expr_ is a
dynamic_pointer_cast to EmptyExpr. This separation improves readability and
makes the code easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
velox/connectors/clp/search_lib/CMakeLists.txt (1)
34-35
: Add the missing newline at end-of-file
The earlier review already flagged this; the file still ends without a newline, which triggers some linters/CI checks.target_compile_features(clp-s-search PRIVATE cxx_std_20) +
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
CMake/resolve_dependency_modules/ystdlib_cpp.cmake
(1 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/docs/develop/connectors.rst
(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linux Build using GCC
velox/connectors/clp/search_lib/CMakeLists.txt
[error] 22-22: ExternalProject_Add failed due to invalid URL_HASH format for nlohmann_json dependency.
🔇 Additional comments (1)
CMake/resolve_dependency_modules/ystdlib_cpp.cmake (1)
1-1
: Header cleanup looks good.The stray UI artefact reported earlier is gone – thanks for fixing it.
FetchContent_Declare( | ||
ystdlib_cpp | ||
GIT_REPOSITORY https://github.com/y-scope/ystdlib-cpp.git | ||
GIT_TAG d80cf86e1a1f2dae6421978c8ee353408368f424) | ||
|
||
FetchContent_Populate(ystdlib_cpp) | ||
|
||
set(CLP_YSTDLIB_SOURCE_DIRECTORY "${ystdlib_cpp_SOURCE_DIR}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Shallow-clone the dependency to speed up CMake configure time.
FetchContent_Declare
defaults to a full clone. For libraries that are used in a header-only fashion (as appears to be the case here), fetching history is unnecessary and slows CI builds.
FetchContent_Declare(
ystdlib_cpp
GIT_REPOSITORY https://github.com/y-scope/ystdlib-cpp.git
GIT_TAG d80cf86e1a1f2dae6421978c8ee353408368f424
+ GIT_SHALLOW TRUE # Only fetch the required commit
+ GIT_PROGRESS TRUE # Show progress in verbose builds
+ FETCHCONTENT_UPDATES_DISCONNECTED ON # Reproducible builds
)
These flags are entirely optional but cheap to adopt and improve build ergonomics.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/ystdlib_cpp.cmake around lines 16 to 23, the
FetchContent_Declare call performs a full git clone by default, which slows down
CMake configure time. Modify the FetchContent_Declare command to include shallow
clone options such as GIT_SHALLOW TRUE or GIT_DEPTH 1 to fetch only the latest
commit, speeding up the clone process for this header-only library.
CLP Connector | ||
------------- | ||
The CLP Connector is used to read CLP archives stored in a local file system or S3. It implements similar | ||
interfaces as the Hive Connector except for the `DataSink` interface. Here we only describe the `DataSource` | ||
interface and the `ConnectorSplit` interface implementation since `Connector` and `ConnectorFactory` are | ||
similar to the Hive Connector. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Clarify “both” vs. “three” connectors.
The preceding section lists three connectors, but the sentence begins with “both”. Consider re-phrasing to avoid possible confusion.
E.g.:
-similar to the Hive Connector.
+similar to the Hive connector.
…or explicitly state that only the DataSource side of Hive and CLP are discussed here.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 125 to 131, the text
incorrectly uses "both" when referring to connectors, causing confusion since
three connectors are listed earlier. Update the sentence to clarify that only
the DataSource side of the Hive and CLP connectors are discussed here, or
rephrase to explicitly mention the two connectors being referenced instead of
"both."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
CMake/resolve_dependency_modules/date.cmake (1)
19-21
: Fix variable reference inconsistency.The URL construction uses
DATE_BUILD_VERSION
but the variable is defined asVELOX_DATE_BUILD_VERSION
on line 16.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
CMake/resolve_dependency_modules/date.cmake
(1 hunks)velox/docs/configs.rst
(1 hunks)
9d0e735
to
46754a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
CMake/resolve_dependency_modules/nlohmann_json.cmake (1)
27-31
: MissingSHA256=
prefix inURL_HASH
.Same concern as previously raised: CMake requires the algorithm prefix.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
CMake/resolve_dependency_modules/date.cmake
(1 hunks)CMake/resolve_dependency_modules/nlohmann_json.cmake
(1 hunks)
set(CMAKE_INSTALL_MESSAGE | ||
LAZY | ||
CACHE STRING "" FORCE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Global override of CMAKE_INSTALL_MESSAGE
may have unintended side-effects.
Forcing CMAKE_INSTALL_MESSAGE=LAZY
with CACHE ... FORCE
changes behaviour for all subsequent projects, not just date
.
Consider setting it locally (e.g., via set_property(TARGET date PROPERTY ...)
) or omitting the FORCE
flag.
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/date.cmake at lines 26 to 28, avoid globally
overriding CMAKE_INSTALL_MESSAGE with CACHE and FORCE as it affects all
projects. Instead, remove the FORCE flag to prevent forcing the value globally
or set the install message property locally on the date target using
set_property(TARGET date PROPERTY ...). This confines the setting to the date
module only and avoids side effects on other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anlowee Thanks for confirming—glad we’re aligned. Let me know if you’d like any help scoping that CMAKE_INSTALL_MESSAGE
change.
🧠 Learnings used
Learnt from: anlowee
PR: y-scope/velox#14
File: CMake/resolve_dependency_modules/date.cmake:32-36
Timestamp: 2025-07-03T17:36:47.311Z
Learning: In CMake FetchContent_Declare, the URL_HASH argument requires an algorithm prefix (e.g., SHA256=) followed by the checksum value. Without this prefix, CMake silently skips checksum validation, which defeats supply-chain protection.
FetchContent_Declare( | ||
date | ||
URL ${VELOX_DATE_SOURCE_URL} | ||
URL_HASH ${VELOX_DATE_BUILD_SHA256_CHECKSUM} | ||
OVERRIDE_FIND_PACKAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL_HASH is missing the required algorithm prefix (SHA256=
).
CMake expects the URL_HASH
argument in the form SHA256=<hash>
.
Without the prefix, checksum validation is silently skipped, defeating supply-chain protection.
FetchContent_Declare(
date
URL ${VELOX_DATE_SOURCE_URL}
- URL_HASH ${VELOX_DATE_BUILD_SHA256_CHECKSUM}
+ URL_HASH SHA256=${VELOX_DATE_BUILD_SHA256_CHECKSUM}
OVERRIDE_FIND_PACKAGE)
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/date.cmake around lines 32 to 36, the
URL_HASH argument is missing the required algorithm prefix. Update the URL_HASH
value to include the prefix "SHA256=" followed by the checksum, so it reads as
SHA256=${VELOX_DATE_BUILD_SHA256_CHECKSUM}. This ensures CMake performs checksum
validation correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anlowee Thanks for confirming! You're absolutely right - the URL_HASH does need the SHA256= prefix for proper checksum validation. This will help ensure the integrity of the downloaded date library dependency.
✏️ Learnings added
Learnt from: anlowee
PR: y-scope/velox#14
File: CMake/resolve_dependency_modules/date.cmake:32-36
Timestamp: 2025-07-03T17:36:47.311Z
Learning: In CMake FetchContent_Declare, the URL_HASH argument requires an algorithm prefix (e.g., SHA256=) followed by the checksum value. Without this prefix, CMake silently skips checksum validation, which defeats supply-chain protection.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
velox_resolve_dependency adds SHA256. Manually adding it will result in errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
CMake/resolve_dependency_modules/nlohmann_json.cmake (1)
27-31
:URL_HASH
still lacks algorithm prefixSame issue as previously reported; the checksum must be prefixed with
SHA256=
to satisfy CMake’sURL_HASH
format.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
CMake/resolve_dependency_modules/antlr4-runtime.cmake
(1 hunks)CMake/resolve_dependency_modules/nlohmann_json.cmake
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (1)
CMake/resolve_dependency_modules/antlr4-runtime.cmake (1)
19-24
: Double-check variable-name consistency
velox_resolve_dependency_url(ANTLR4_RUNTIME)
traditionally sets/expectsVELOX_ANTLR4_RUNTIME_URL
, yet this script later referencesVELOX_ANTLR4_RUNTIME_SOURCE_URL
. If the helper macro has not been updated accordingly, build-time substitution will fail. Please verify that both the macro and this file use the same canonical variable name across all dependency modules.
antlr4-runtime | ||
URL ${VELOX_ANTLR4_RUNTIME_SOURCE_URL} | ||
URL_HASH ${VELOX_ANTLR4_RUNTIME_BUILD_SHA256_CHECKSUM} | ||
SOURCE_SUBDIR runtime/Cpp OVERRIDE_FIND_PACKAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL_HASH
missing required algorithm prefix
CMake ≥ 3.24 requires the hash value to be prefixed with the algorithm (e.g., SHA256=
). Without it the download fails the integrity check.
FetchContent_Declare(
antlr4-runtime
URL ${VELOX_ANTLR4_RUNTIME_SOURCE_URL}
- URL_HASH ${VELOX_ANTLR4_RUNTIME_BUILD_SHA256_CHECKSUM}
+ URL_HASH SHA256=${VELOX_ANTLR4_RUNTIME_BUILD_SHA256_CHECKSUM}
SOURCE_SUBDIR runtime/Cpp OVERRIDE_FIND_PACKAGE)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
antlr4-runtime | |
URL ${VELOX_ANTLR4_RUNTIME_SOURCE_URL} | |
URL_HASH ${VELOX_ANTLR4_RUNTIME_BUILD_SHA256_CHECKSUM} | |
SOURCE_SUBDIR runtime/Cpp OVERRIDE_FIND_PACKAGE) | |
FetchContent_Declare( | |
antlr4-runtime | |
URL ${VELOX_ANTLR4_RUNTIME_SOURCE_URL} | |
URL_HASH SHA256=${VELOX_ANTLR4_RUNTIME_BUILD_SHA256_CHECKSUM} | |
SOURCE_SUBDIR runtime/Cpp OVERRIDE_FIND_PACKAGE) |
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/antlr4-runtime.cmake around lines 28 to 31,
the URL_HASH value is missing the required algorithm prefix. Update the URL_HASH
to include the prefix "SHA256=" before the checksum value to comply with CMake ≥
3.24 requirements and ensure the integrity check passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. Our min CMaker version is 3.28, so this might be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anlowee Thank you for confirming! The SHA256=
prefix will ensure the integrity check works correctly across different CMake versions. This fix will prevent potential build failures when using newer CMake versions.
/| /|
( :v: )
|(_)|
| |
/ \
/ \
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
CMake/resolve_dependency_modules/msgpack-cxx.cmake
(1 hunks)velox/docs/configs.rst
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (1)
CMake/resolve_dependency_modules/msgpack-cxx.cmake (1)
16-21
: ```shell
#!/bin/bashDownload the tarball and compute SHA256 using available tools
curl -sL "https://github.com/msgpack/msgpack-c/archive/refs/tags/cpp-7.0.0.tar.gz" -o /tmp/msgpack.tar.gz
if command -v openssl >/dev/null; then
openssl dgst -sha256 /tmp/msgpack.tar.gz
elif command -v shasum >/dev/null; then
shasum -a 256 /tmp/msgpack.tar.gz
elif command -v python3 >/dev/null; then
python3 - << 'EOF'
import hashlib
with open('/tmp/msgpack.tar.gz', 'rb') as f:
print(hashlib.sha256(f.read()).hexdigest())
EOF
else
echo "Error: No tool available to compute SHA256"
exit 1
fi</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
* - clp.storage-type | ||
- string | ||
- FS | ||
- The storage type of CLP archives, **Allowed values:** ``FS``, ``S3``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Clarify value semantics and case-sensitivity
To avoid user confusion, please add a brief note on whether the clp.storage-type
values are case-sensitive and what behaviour to expect when an unrecognised value is supplied (e.g., fallback to default vs. startup failure). A single sentence in the Description cell would suffice.
🤖 Prompt for AI Agents
In velox/docs/configs.rst around lines 913 to 916, add a sentence in the
Description cell for clp.storage-type clarifying whether the allowed values "FS"
and "S3" are case-sensitive and specify the behavior if an unrecognized value is
provided, such as whether the system falls back to a default or fails to start.
This will help users understand the expected input format and error handling.
velox_resolve_dependency_url(MSGPACK) | ||
|
||
message(STATUS "Building msgpack-cxx from source") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Status message may be misleading after URL override.
velox_resolve_dependency_url(MSGPACK)
can rewrite the URL to a mirror or
pre-downloaded cache, yet the unconditional
message(STATUS "Building msgpack-cxx from source")
follows immediately. If a binary package is later introduced (or a system
package is found), the log would still announce a source build, confusing
developers.
Consider printing the message only when the resolver decided to build from
source, or phrase it more generically (“Using msgpack-cxx from
${VELOX_MSGPACK_SOURCE_URL}”).
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/msgpack-cxx.cmake around lines 23 to 25, the
status message always states "Building msgpack-cxx from source" immediately
after calling velox_resolve_dependency_url(MSGPACK), which may be misleading if
the URL was overridden or a binary package is used later. Modify the code to
print the message conditionally only when the build is actually from source, or
change the message to a more generic form like "Using msgpack-cxx from
${VELOX_MSGPACK_SOURCE_URL}" to accurately reflect the source or binary usage.
FetchContent_Declare( | ||
msgpack-cxx | ||
URL ${VELOX_MSGPACK_SOURCE_URL} | ||
URL_HASH ${VELOX_MSGPACK_BUILD_SHA256_CHECKSUM} | ||
OVERRIDE_FIND_PACKAGE EXCLUDE_FROM_ALL SYSTEM) | ||
|
||
set(MSGPACK_USE_BOOST OFF) | ||
|
||
FetchContent_MakeAvailable(msgpack-cxx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Expose an imported target for downstream linking convenience.
Many third-party projects expect to link against an msgpackc
or
msgpack-cxx
target created by the upstream CMake. After
FetchContent_MakeAvailable
, that target lives inside the
msgpack-cxx
sub-build and is not guaranteed to be globally visible.
Add:
# Ensure an ALIAS exists at the top level.
if (NOT TARGET msgpackc::msgpackc AND TARGET msgpackc)
add_library(msgpackc::msgpackc ALIAS msgpackc)
endif()
This small alias spares every dependent Velox component from writing custom
find_package_handle
code.
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/msgpack-cxx.cmake around lines 27 to 35, the
imported target for msgpack-cxx is not globally visible after
FetchContent_MakeAvailable, which complicates downstream linking. To fix this,
add an alias target at the top level by checking if msgpackc::msgpackc does not
exist but msgpackc does, then create an ALIAS target named msgpackc::msgpackc
pointing to msgpackc. This will make the target globally accessible for
dependent projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
CMakeLists.txt (1)
178-187
: Flag override is OK but brittleThe manual reset to
OFF
works, yet it breaks as soon as a third “minimal” preset is added and someone forgets to touch this line.
Consider centralising connector-option resets (e.g., iterate over a list) to avoid future drift.
No immediate change required.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
CMakeLists.txt
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (1)
CMakeLists.txt (1)
464-470
: Boost component additions look good
headers
,iostreams
, andurl
are harmless header-only (or thin) targets. Nothing else to flag.
option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." ON) | ||
option(VELOX_ENABLE_HIVE_CONNECTOR "Build Hive connector." ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing compile-time define for the new flag
Other feature toggles (S3
, ABFS
, PARQUET
, …) inject a -DVELOX_ENABLE_*
definition immediately after the option is parsed.
VELOX_ENABLE_CLP_CONNECTOR
never sets such a macro, so C++ sources cannot conditionally compile on it.
option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." ON)
+
+if(VELOX_ENABLE_CLP_CONNECTOR)
+ add_definitions(-DVELOX_ENABLE_CLP_CONNECTOR)
+endif()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." ON) | |
option(VELOX_ENABLE_HIVE_CONNECTOR "Build Hive connector." ON) | |
option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." ON) | |
if(VELOX_ENABLE_CLP_CONNECTOR) | |
add_definitions(-DVELOX_ENABLE_CLP_CONNECTOR) | |
endif() | |
option(VELOX_ENABLE_HIVE_CONNECTOR "Build Hive connector." ON) |
🤖 Prompt for AI Agents
In CMakeLists.txt around lines 130 to 131, the option VELOX_ENABLE_CLP_CONNECTOR
is defined but does not set a corresponding compile-time macro. To fix this, add
a check after the option definition that appends -DVELOX_ENABLE_CLP_CONNECTOR to
the compiler flags if the option is ON, similar to how other feature toggles
like S3 or ABFS do it. This will enable conditional compilation in the C++
sources based on this flag.
if(${VELOX_ENABLE_CLP_CONNECTOR}) | ||
set(ystdlib_cpp_SOURCE BUNDLED) | ||
velox_resolve_dependency(ystdlib_cpp) | ||
|
||
velox_set_source(absl) | ||
velox_resolve_dependency(absl) | ||
|
||
velox_set_source(nlohmann_json) | ||
velox_resolve_dependency(nlohmann_json) | ||
|
||
velox_set_source(simdjson) | ||
velox_resolve_dependency(simdjson) | ||
|
||
velox_set_source(date) | ||
velox_resolve_dependency(date) | ||
|
||
velox_set_source(spdlog) | ||
velox_resolve_dependency(spdlog) | ||
|
||
velox_set_source(msgpack-cxx) | ||
velox_resolve_dependency(msgpack-cxx) | ||
|
||
velox_set_source(antlr4-runtime) | ||
velox_resolve_dependency(antlr4-runtime) | ||
|
||
set(clp_SOURCE BUNDLED) | ||
velox_resolve_dependency(clp) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate dependency resolution may lead to version skew
absl
and simdjson
are resolved twice: once in this CLP block and again later in the global section (lines 576-586).
If the two calls pick different default versions or options, CMake will error or, worse, silently choose one, causing ODR/ABI issues.
At minimum, guard the re-resolves:
- velox_set_source(absl)
- velox_resolve_dependency(absl)
+ if(NOT TARGET absl::absl)
+ velox_set_source(absl)
+ velox_resolve_dependency(absl)
+ endif()
- velox_set_source(simdjson)
- velox_resolve_dependency(simdjson)
+ if(NOT TARGET simdjson::simdjson)
+ velox_set_source(simdjson)
+ velox_resolve_dependency(simdjson)
+ endif()
The same pattern applies to every library already handled elsewhere.
Alternatively, remove them here and rely on the later unconditional resolve block.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if(${VELOX_ENABLE_CLP_CONNECTOR}) | |
set(ystdlib_cpp_SOURCE BUNDLED) | |
velox_resolve_dependency(ystdlib_cpp) | |
velox_set_source(absl) | |
velox_resolve_dependency(absl) | |
velox_set_source(nlohmann_json) | |
velox_resolve_dependency(nlohmann_json) | |
velox_set_source(simdjson) | |
velox_resolve_dependency(simdjson) | |
velox_set_source(date) | |
velox_resolve_dependency(date) | |
velox_set_source(spdlog) | |
velox_resolve_dependency(spdlog) | |
velox_set_source(msgpack-cxx) | |
velox_resolve_dependency(msgpack-cxx) | |
velox_set_source(antlr4-runtime) | |
velox_resolve_dependency(antlr4-runtime) | |
set(clp_SOURCE BUNDLED) | |
velox_resolve_dependency(clp) | |
endif() | |
if(${VELOX_ENABLE_CLP_CONNECTOR}) | |
set(ystdlib_cpp_SOURCE BUNDLED) | |
velox_resolve_dependency(ystdlib_cpp) | |
if(NOT TARGET absl::absl) | |
velox_set_source(absl) | |
velox_resolve_dependency(absl) | |
endif() | |
velox_set_source(nlohmann_json) | |
velox_resolve_dependency(nlohmann_json) | |
if(NOT TARGET simdjson::simdjson) | |
velox_set_source(simdjson) | |
velox_resolve_dependency(simdjson) | |
endif() | |
velox_set_source(date) | |
velox_resolve_dependency(date) | |
velox_set_source(spdlog) | |
velox_resolve_dependency(spdlog) | |
velox_set_source(msgpack-cxx) | |
velox_resolve_dependency(msgpack-cxx) | |
velox_set_source(antlr4-runtime) | |
velox_resolve_dependency(antlr4-runtime) | |
set(clp_SOURCE BUNDLED) | |
velox_resolve_dependency(clp) | |
endif() |
🤖 Prompt for AI Agents
In CMakeLists.txt between lines 526 and 553, the dependencies absl and simdjson
(and others) are resolved again although they are already resolved later in
lines 576-586, which risks version conflicts. To fix this, remove the duplicate
velox_set_source and velox_resolve_dependency calls for libraries already
resolved globally, or add conditional guards to skip resolving them again in
this block. This ensures each dependency is resolved only once to avoid CMake
errors or runtime issues.
b8fc0b2
to
e9f9046
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
CMake/resolve_dependency_modules/boost/CMakeLists.txt
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
URL_HASH ${VELOX_BOOST_BUILD_SHA256_CHECKSUM} | ||
OVERRIDE_FIND_PACKAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the algorithm prefix to URL_HASH
to avoid CMake fetch failures
FetchContent_Declare()
expects URL_HASH
in the form SHA256=<hash>
(or MD5=<hash>
, etc.).
Passing only the raw hex string causes CMake to error out with “Invalid hash string” on 3.14+.
-set(VELOX_BOOST_BUILD_SHA256_CHECKSUM
- 4d27e9efed0f6f152dc28db6430b9d3dfb40c0345da7342eaa5a987dde57bd95)
+set(VELOX_BOOST_BUILD_SHA256_CHECKSUM
+ SHA256=4d27e9efed0f6f152dc28db6430b9d3dfb40c0345da7342eaa5a987dde57bd95)
With that change the subsequent URL_HASH ${VELOX_BOOST_BUILD_SHA256_CHECKSUM}
line remains valid.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
URL_HASH ${VELOX_BOOST_BUILD_SHA256_CHECKSUM} | |
OVERRIDE_FIND_PACKAGE) | |
# CMake/resolve_dependency_modules/boost/CMakeLists.txt | |
# Ensure the checksum is prefixed with the algorithm name | |
set(VELOX_BOOST_BUILD_SHA256_CHECKSUM | |
SHA256=4d27e9efed0f6f152dc28db6430b9d3dfb40c0345da7342eaa5a987dde57bd95) | |
FetchContent_Declare( | |
VELOX_BOOST_BUILD | |
URL ${VELOX_BOOST_BUILD_URL} | |
URL_HASH ${VELOX_BOOST_BUILD_SHA256_CHECKSUM} | |
OVERRIDE_FIND_PACKAGE) |
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/boost/CMakeLists.txt around lines 45 to 46,
the URL_HASH value is set without specifying the hash algorithm prefix, causing
CMake fetch failures. Fix this by prefixing the hash string with "SHA256=" so
that URL_HASH is set as "SHA256=${VELOX_BOOST_BUILD_SHA256_CHECKSUM}", ensuring
compatibility with CMake 3.14 and later.
…ger-overflow (facebookincubator#13831) Summary: Pull Request resolved: facebookincubator#13831 This avoids the following errors: ``` fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41: runtime error: negation of -9223372036854775808 cannot be represented in type 'long'; cast to an unsigned type to negate this value to itself #0 0x000000346ce5 in std::abs(long) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56 #1 0x000000345879 in std::shared_ptr<facebook::velox::BiasVector<facebook::velox::test::EvalTypeHelper<long>::Type>> facebook::velox::test::VectorMaker::biasVector<long>(std::vector<std::optional<long>, std::allocator<std::optional<long>>> const&) fbcode/velox/vector/tests/utils/VectorMaker-inl.h:58 #2 0x000000344d34 in facebook::velox::test::BiasVectorErrorTest::errorTest(std::vector<std::optional<long>, std::allocator<std::optional<long>>>) fbcode/velox/vector/tests/BiasVectorTest.cpp:39 #3 0x00000033ec99 in facebook::velox::test::BiasVectorErrorTest_checkRangeTooLargeError_Test::TestBody() fbcode/velox/vector/tests/BiasVectorTest.cpp:44 #4 0x7fe0a2342c46 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2727 #5 0x7fe0a234275d in testing::Test::Run() fbsource/src/gtest.cc:2744 #6 0x7fe0a2345fb3 in testing::TestInfo::Run() fbsource/src/gtest.cc:2890 #7 0x7fe0a234c8eb in testing::TestSuite::Run() fbsource/src/gtest.cc:3068 #8 0x7fe0a237b52b in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:6059 #9 0x7fe0a237a0a2 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2727 #10 0x7fe0a23797f5 in testing::UnitTest::Run() fbsource/src/gtest.cc:5599 #11 0x7fe0a2239800 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2334 #12 0x7fe0a223952c in main fbcode/common/gtest/LightMain.cpp:20 #13 0x7fe09ec2c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #14 0x7fe09ec2c717 in __libc_start_main@GLIBC_2.2.5 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3 #15 0x00000033d8b0 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116 UndefinedBehaviorSanitizer: signed-integer-overflow fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41 ``` Avoid overflow by using the expression (static_cast<uint64_t>(1) + ~static_cast<uint64_t>(min)) to calculate the absolute value of min without using std::abs Reviewed By: dmm-fb, peterenescu Differential Revision: D76901449 fbshipit-source-id: 7eb3bd0f83e42f44cdf34ea1759f3aa9e1042dae
13d1585
to
a8c3dea
Compare
GIT_REPOSITORY https://github.com/y-scope/ystdlib-cpp.git | ||
GIT_TAG 0ae886c6a7ee706a3c6e1950262b63d72f71fe63) | ||
|
||
FetchContent_Populate(ystdlib_cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we upgrade clp to install ystdlib as a package, we should be able to use MakeAvailable instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (10)
velox/connectors/clp/ClpConfig.h (2)
34-34
: Add documentation for the configuration key.Consider documenting what this configuration key controls for better maintainability.
+ // Configuration key to specify the storage backend type (FS or S3) static constexpr const char* kStorageType = "clp.storage-type";
36-39
: Optimize constructor with member initializer list.The constructor can be more efficient by using a member initializer list.
- explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) { - VELOX_CHECK_NOT_NULL(config, "Config is null for CLP initialization"); - config_ = std::move(config); - } + explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) + : config_(std::move(config)) { + VELOX_CHECK_NOT_NULL(config_, "Config is null for CLP initialization"); + }velox/docs/develop/connectors.rst (1)
146-146
: Fix missing newline at end of file.The pipeline failure indicates a missing newline character.
decode and load data as needed during execution. +
velox/connectors/clp/search_lib/ClpVectorLoader.h (3)
19-21
: Add missing include headers.The file uses standard library types and simdjson without explicitly including their headers.
#include "velox/connectors/clp/search_lib/ClpCursor.h" #include "velox/vector/LazyVector.h" +#include <memory> +#include <vector> +#include <simdjson.h>
33-36
: Document ownership semantics for the raw pointer parameter.The raw pointer parameter needs clear documentation about ownership and lifetime requirements.
+ /// @param columnReader Non-owning pointer to a CLP column reader. The caller + /// must ensure the reader remains valid for the lifetime + /// of this ClpVectorLoader instance. + /// @param nodeType The column type being loaded + /// @param filteredRowIndices Shared pointer to filtered row indices ClpVectorLoader( clp_s::BaseColumnReader* columnReader, ColumnType nodeType, std::shared_ptr<std::vector<uint64_t>> filteredRowIndices);
39-46
: Add documentation for private methods.The private methods should be documented to improve code maintainability.
+ /// Implements the VectorLoader interface to load data from CLP column readers + /// into Velox vectors based on the filtered row indices. void loadInternal( RowSet rows, ValueHook* hook, vector_size_t resultSize, VectorPtr* result) override; + /// Template helper to populate typed vector data from the column reader + /// @tparam T The data type being populated + /// @tparam VectorPtr The specific vector type (e.g., FlatVector<T>) template <typename T, typename VectorPtr> void populateData(RowSet rows, VectorPtr vector);velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
71-73
: Use const shared_ptr for filteredRowIndices parameterSince the ownership of
filteredRowIndices
is transferred toClpVectorLoader
, the parameter should be const to clarify ownership semantics and prevent accidental modifications.Apply this diff:
uint64_t fetchNext( uint64_t numRows, - const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices); + const std::shared_ptr<std::vector<uint64_t>> filteredRowIndices);velox/connectors/clp/ClpDataSource.cpp (2)
121-121
: Remove reference from const shared_ptr parameterSince the ownership of
filteredRows
is transferred toClpVectorLoader
, the parameter should be passed by value.Apply this diff:
- const std::shared_ptr<std::vector<uint64_t>>& filteredRows, + const std::shared_ptr<std::vector<uint64_t>> filteredRows,
103-115
: Add error handling for cursor initialization and query executionThe method creates a cursor and executes a query without checking for errors. This could lead to undefined behaviour if either operation fails.
Apply this diff to add proper error handling:
void ClpDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) { auto clpSplit = std::dynamic_pointer_cast<ClpConnectorSplit>(split); + VELOX_CHECK_NOT_NULL(clpSplit, "Invalid split type for CLP connector"); if (storageType_ == ClpConfig::StorageType::kFs) { cursor_ = std::make_unique<search_lib::ClpCursor>( clp_s::InputSource::Filesystem, clpSplit->path_); } else if (storageType_ == ClpConfig::StorageType::kS3) { cursor_ = std::make_unique<search_lib::ClpCursor>( clp_s::InputSource::Network, clpSplit->path_); + } else { + VELOX_USER_FAIL("Unsupported storage type"); } + VELOX_CHECK_NOT_NULL(cursor_, "Failed to create CLP cursor"); + try { cursor_->executeQuery(kqlQuery_, fields_); + } catch (const std::exception& e) { + VELOX_USER_FAIL("Failed to execute query '{}': {}", kqlQuery_, e.what()); + } }velox/connectors/clp/ClpDataSource.h (1)
98-99
: Remove unused member variableThe
columnUntypedNames_
member variable is declared but never used in the implementation.Apply this diff:
velox::memory::MemoryPool* pool_; RowTypePtr outputType_; - std::set<std::string> columnUntypedNames_; uint64_t completedRows_{0};
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (9)
.github/workflows/linux-build-base.yml
(1 hunks)velox/connectors/clp/ClpConfig.cpp
(1 hunks)velox/connectors/clp/ClpConfig.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/ClpDataSource.h
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.h
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.h
(1 hunks)velox/docs/develop/connectors.rst
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
velox/connectors/clp/ClpDataSource.h (1)
Learnt from: anlowee
PR: y-scope/velox#1
File: CMakeLists.txt:458-483
Timestamp: 2025-03-10T05:55:18.722Z
Learning: When contributing to the Velox project, prefer adding new code on top of existing scripts rather than restructuring the original code organization, especially for dependency management.
🧬 Code Graph Analysis (1)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
outputType
(39-45)
🪛 GitHub Actions: Run Checks
velox/docs/develop/connectors.rst
[error] 146-146: end-of-file-fixer hook failed: missing newline at end of file. The hook fixed the file by adding a newline.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
- GitHub Check: Build
🔇 Additional comments (1)
velox/connectors/clp/ClpConfig.cpp (1)
25-40
: LGTM!The implementation correctly handles case-insensitive string to enum conversion and provides appropriate error handling for unsupported storage types.
- name: Build Artifact | ||
env: | ||
VELOX_DEPENDENCY_SOURCE: BUNDLED | ||
Boost_SOURCE: SYSTEM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment variable likely ignored – use all-caps BOOST_SOURCE
instead of Boost_SOURCE
.
resolve_dependency.cmake
converts package names to upper-case (e.g., Boost
→ BOOST
) and then looks for BOOST_SOURCE
.
With the current mixed-case key, Boost will silently fall back to the default (BUNDLED
), negating the intent of using the system-installed libraries and increasing build time.
- Boost_SOURCE: SYSTEM
+ BOOST_SOURCE: SYSTEM
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Boost_SOURCE: SYSTEM | |
BOOST_SOURCE: SYSTEM |
🤖 Prompt for AI Agents
In .github/workflows/linux-build-base.yml at line 55, the environment variable
key is incorrectly cased as Boost_SOURCE, which is ignored by the build system.
Change the key to all uppercase BOOST_SOURCE to ensure the system-installed
Boost libraries are used instead of the default bundled ones, preventing
unnecessary build time increases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
velox/docs/develop/connectors.rst
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
Velox provides Hive, CLP, and TPC-H Connectors out of the box. | ||
Let's examine the implementation details of both the Hive and CLP Connectors as examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Prefer lowercase “connectors” for consistency with doc style
Throughout the document, common nouns aren’t capitalised (e.g., “storage adapters”). Sticking to that pattern improves consistency and helps avoid accidental emphasis.
-Velox provides Hive, CLP, and TPC-H Connectors out of the box.
-Let's examine the implementation details of both the Hive and CLP Connectors as examples.
+Velox provides Hive, CLP, and TPC-H connectors out of the box.
+Let's examine the implementation details of the Hive and CLP connectors as examples.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Velox provides Hive, CLP, and TPC-H Connectors out of the box. | |
Let's examine the implementation details of both the Hive and CLP Connectors as examples. | |
Velox provides Hive, CLP, and TPC-H connectors out of the box. | |
Let's examine the implementation details of the Hive and CLP connectors as examples. |
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 32 to 33, the word
"Connectors" is capitalized inconsistently with the document style. Change
"Connectors" to lowercase "connectors" in both instances to maintain consistency
with the style used for common nouns throughout the document.
The CLP Connector is used to read CLP archives stored on a local file system or S3. It implements similar | ||
interfaces as the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` | ||
interface and the ``ConnectorSplit`` interface implementation since `Connector` and ``ConnectorFactory`` are | ||
similar to the Hive Connector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Tighten grammar and unify inline-code markup
Small wording tweak plus consistent use of double back-ticks for inline class names.
-The CLP Connector is used to read CLP archives stored on a local file system or S3. It implements similar
-interfaces as the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource``
-interface and the ``ConnectorSplit`` interface implementation since `Connector` and ``ConnectorFactory`` are
-similar to the Hive Connector.
+The CLP connector reads CLP archives stored on a local file system or S3. It implements interfaces similar
+to those of the Hive connector, except for the ``DataSink`` interface. Here we describe only the ``DataSource``
+and ``ConnectorSplit`` interface implementations, since ``Connector`` and ``ConnectorFactory`` behave
+similarly to their Hive counterparts.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
The CLP Connector is used to read CLP archives stored on a local file system or S3. It implements similar | |
interfaces as the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` | |
interface and the ``ConnectorSplit`` interface implementation since `Connector` and ``ConnectorFactory`` are | |
similar to the Hive Connector. | |
The CLP connector reads CLP archives stored on a local file system or S3. It implements interfaces similar | |
to those of the Hive connector, except for the ``DataSink`` interface. Here we describe only the ``DataSource`` | |
and ``ConnectorSplit`` interface implementations, since ``Connector`` and ``ConnectorFactory`` behave | |
similarly to their Hive counterparts. |
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 127 to 130, improve the
grammar and ensure consistent use of double back-ticks for inline class names.
Adjust the wording slightly for clarity and replace any single back-ticks around
class names with double back-ticks to unify the markup style throughout the
paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (2)
24-26
: Add blank line before method definition.namespace facebook::velox::connector::clp::search_lib { + void ClpQueryRunner::init(
64-74
: Fix variable naming consistency.The variable
rowsfiltered
should follow camelCase naming convention to be consistent with the codebase style.- size_t rowsfiltered{0}; + size_t rowsFiltered{0}; size_t rowsScanned{0}; while (curMessage_ < numMessages_) { if (filter(curMessage_)) { filteredRowIndices->emplace_back(curMessage_); - rowsfiltered += 1; + rowsFiltered += 1; } curMessage_ += 1; rowsScanned += 1; - if (rowsfiltered >= numRows) { + if (rowsFiltered >= numRows) {velox/connectors/clp/ClpConnector.cpp (1)
52-60
: Consolidate TimestampPattern initialization to avoid duplication.Both constructors call
clp_s::TimestampPattern::init()
. Consider using constructor delegation or extracting the initialization to a private method.Using constructor delegation:
ClpConnectorFactory::ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) { clp_s::TimestampPattern::init(); } ClpConnectorFactory::ClpConnectorFactory(const char* connectorName) - : ConnectorFactory(connectorName) { - clp_s::TimestampPattern::init(); -} + : ConnectorFactory(connectorName) { + // Delegate initialization to a common method + initializeTimestampPattern(); +} + +void ClpConnectorFactory::initializeTimestampPattern() { + static std::once_flag initFlag; + std::call_once(initFlag, []() { + clp_s::TimestampPattern::init(); + }); +}Note: Using
std::once_flag
ensures thread-safe initialization ifTimestampPattern::init()
should only be called once globally.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
velox/connectors/clp/ClpConnector.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.cpp
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/connectors/clp/ClpConnector.cpp (2)
24-28
: Constructor implementation looks good!The use of
std::move
for the config parameter is appropriate since it's passed by value. This follows C++ best practices for efficient ownership transfer.
29-42
: Data source creation is properly implemented.The method correctly instantiates
ClpDataSource
with all required parameters and returns it as a unique pointer.
std::unique_ptr<DataSink> ClpConnector::createDataSink( | ||
RowTypePtr inputType, | ||
std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle, | ||
ConnectorQueryCtx* connectorQueryCtx, | ||
CommitStrategy commitStrategy) { | ||
VELOX_NYI("createDataSink for ClpConnector is not implemented!"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Mark unused parameters to avoid compiler warnings.
Since this method is not yet implemented, consider marking the parameters as unused to avoid compiler warnings.
std::unique_ptr<DataSink> ClpConnector::createDataSink(
- RowTypePtr inputType,
- std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle,
- ConnectorQueryCtx* connectorQueryCtx,
- CommitStrategy commitStrategy) {
+ RowTypePtr /*inputType*/,
+ std::shared_ptr<ConnectorInsertTableHandle> /*connectorInsertTableHandle*/,
+ ConnectorQueryCtx* /*connectorQueryCtx*/,
+ CommitStrategy /*commitStrategy*/) {
VELOX_NYI("createDataSink for ClpConnector is not implemented!");
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::unique_ptr<DataSink> ClpConnector::createDataSink( | |
RowTypePtr inputType, | |
std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle, | |
ConnectorQueryCtx* connectorQueryCtx, | |
CommitStrategy commitStrategy) { | |
VELOX_NYI("createDataSink for ClpConnector is not implemented!"); | |
} | |
std::unique_ptr<DataSink> ClpConnector::createDataSink( | |
RowTypePtr /*inputType*/, | |
std::shared_ptr<ConnectorInsertTableHandle> /*connectorInsertTableHandle*/, | |
ConnectorQueryCtx* /*connectorQueryCtx*/, | |
CommitStrategy /*commitStrategy*/) { | |
VELOX_NYI("createDataSink for ClpConnector is not implemented!"); | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/ClpConnector.cpp around lines 44 to 50, the
createDataSink method is not implemented and its parameters are unused, which
can cause compiler warnings. To fix this, mark all the parameters as unused by
casting them to void or using an equivalent method to explicitly indicate they
are unused, preventing compiler warnings while keeping the method stub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
velox/connectors/clp/search_lib/ClpCursor.h (1)
85-87
: Consider using non-const reference instead of shared_ptr for filteredRowIndices.Since
filteredRowIndices
is an in/out parameter that gets modified, usingstd::vector<uint64_t>&
would be more appropriate thanshared_ptr
. This avoids unnecessary shared ownership semantics.uint64_t fetchNext( uint64_t numRows, - const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices); + std::vector<uint64_t>& filteredRowIndices);velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
70-72
: Consider using non-const reference instead of shared_ptr for filteredRowIndices.Consistent with the comment on
ClpCursor::fetchNext
, usingstd::vector<uint64_t>&
would be more appropriate for this in/out parameter.uint64_t fetchNext( uint64_t numRows, - const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices); + std::vector<uint64_t>& filteredRowIndices);velox/connectors/clp/search_lib/ClpCursor.cpp (4)
76-85
: Fix error code handling in schema iteration.Setting
errorCode_
inside the loop gets overwritten in subsequent iterations. Consider tracking schema-specific errors without overwriting the global error state.if (EvaluatedValue::False == queryRunner_->schema_init(currentSchemaId_)) { + LOG(WARNING) << "Dictionary not found for schema " << currentSchemaId_; currentSchemaIndex_ += 1; currentSchemaTableLoaded_ = false; - errorCode_ = ErrorCode::DictionaryNotFound; continue; }
130-148
: Refactor assignments inside if conditions for clarity.Separate the assignment from the condition check to improve readability.
OrOfAndForm standardizePass; - if (expr_ = standardizePass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = standardizePass.run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; } NarrowTypes narrowPass; - if (expr_ = narrowPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = narrowPass.run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; } ConvertToExists convertPass; - if (expr_ = convertPass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + expr_ = convertPass.run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; }
153-255
: Consider refactoringloadArchive
into smaller methods.This method handles multiple responsibilities and is over 100 lines long. Consider breaking it down into focused methods.
Extract logical sections into private methods:
openArchiveWithAuth()
- lines 154-164evaluateTimestampConstraints()
- lines 170-174matchSchemas()
- lines 176-181, 232-240setupProjection()
- lines 183-227loadDictionaries()
- lines 248-250
242-246
: Remove duplicate timestamp evaluation.This timestamp evaluation is redundant as the same check is already performed at lines 170-174.
- EvaluateTimestampIndex timestamp_index(timestampDict); - if (EvaluatedValue::False == timestamp_index.run(expr_)) { - VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; - }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.h
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (5)
velox/connectors/clp/search_lib/ClpCursor.h (2)
41-50
: ErrorCode enum is well-structured.The error codes cover the necessary states for query execution and follow consistent naming conventions.
52-58
: ColumnType enum and Field struct are well-designed.Good use of explicit value for Unknown state to clearly distinguish it from valid types.
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
45-52
: Constructor is properly implemented.The constructor correctly initializes the base class and stores the projection for later use.
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
36-46
: Constructor and destructor are properly implemented.Good practice to close the archive in the destructor and initialize error code to QueryNotInitialized.
107-114
: Well-implemented getter with proper null handling.Good use of static const empty vector for the null case to avoid creating temporary objects.
switch (column.type) { | ||
case ColumnType::String: | ||
columnDescriptor->set_matching_types( | ||
LiteralType::ClpStringT | LiteralType::VarStringT | | ||
LiteralType::EpochDateT); | ||
break; | ||
case ColumnType::Integer: | ||
columnDescriptor->set_matching_types(LiteralType::IntegerT); | ||
break; | ||
case ColumnType::Float: | ||
columnDescriptor->set_matching_types(LiteralType::FloatT); | ||
break; | ||
case ColumnType::Boolean: | ||
columnDescriptor->set_matching_types(LiteralType::BooleanT); | ||
break; | ||
case ColumnType::Array: | ||
columnDescriptor->set_matching_types(LiteralType::ArrayT); | ||
break; | ||
default: | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for unknown column types.
The default case in the switch statement silently ignores unknown column types. Consider logging or returning an error.
case ColumnType::Array:
columnDescriptor->set_matching_types(LiteralType::ArrayT);
break;
default:
+ VLOG(2) << "Unknown column type for column: '" << column.name << "'";
+ return ErrorCode::InternalError;
- break;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
switch (column.type) { | |
case ColumnType::String: | |
columnDescriptor->set_matching_types( | |
LiteralType::ClpStringT | LiteralType::VarStringT | | |
LiteralType::EpochDateT); | |
break; | |
case ColumnType::Integer: | |
columnDescriptor->set_matching_types(LiteralType::IntegerT); | |
break; | |
case ColumnType::Float: | |
columnDescriptor->set_matching_types(LiteralType::FloatT); | |
break; | |
case ColumnType::Boolean: | |
columnDescriptor->set_matching_types(LiteralType::BooleanT); | |
break; | |
case ColumnType::Array: | |
columnDescriptor->set_matching_types(LiteralType::ArrayT); | |
break; | |
default: | |
break; | |
} | |
switch (column.type) { | |
case ColumnType::String: | |
columnDescriptor->set_matching_types( | |
LiteralType::ClpStringT | LiteralType::VarStringT | | |
LiteralType::EpochDateT); | |
break; | |
case ColumnType::Integer: | |
columnDescriptor->set_matching_types(LiteralType::IntegerT); | |
break; | |
case ColumnType::Float: | |
columnDescriptor->set_matching_types(LiteralType::FloatT); | |
break; | |
case ColumnType::Boolean: | |
columnDescriptor->set_matching_types(LiteralType::BooleanT); | |
break; | |
case ColumnType::Array: | |
columnDescriptor->set_matching_types(LiteralType::ArrayT); | |
break; | |
default: | |
VLOG(2) << "Unknown column type for column: '" << column.name << "'"; | |
return ErrorCode::InternalError; | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.cpp around lines 198 to 218, the
switch statement on column.type lacks error handling for unknown types in the
default case. Modify the default case to log an error message or return an error
indicating an unrecognized column type to ensure such cases are not silently
ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (4)
velox/connectors/clp/search_lib/ClpVectorLoader.cpp (2)
44-54
: Add error handling for variant access and bounds checking
122-139
: Fix memory safety and error handling in array processingvelox/connectors/clp/search_lib/ClpCursor.h (2)
66-66
: Usestd::string_view
for thearchivePath
parameter
85-87
: Consider returning error status fromfetchNext
methodThe method currently returns only the number of rows scanned without indicating if an error occurred during execution. This makes error handling difficult for callers.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpVectorLoader.cpp
(1 hunks)
if constexpr (std::is_same_v<T, std::string>) { | ||
auto string_value = | ||
std::get<std::string>(columnReader_->extract_value(messageIndex)); | ||
vector->set(vectorIndex, StringView(string_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential use-after-free with StringView
The string_value
variable goes out of scope while the StringView
still references its data, leading to undefined behavior.
if constexpr (std::is_same_v<T, std::string>) {
- auto string_value =
- std::get<std::string>(columnReader_->extract_value(messageIndex));
- vector->set(vectorIndex, StringView(string_value));
+ vector->set(vectorIndex, std::get<std::string>(columnReader_->extract_value(messageIndex)));
} else {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if constexpr (std::is_same_v<T, std::string>) { | |
auto string_value = | |
std::get<std::string>(columnReader_->extract_value(messageIndex)); | |
vector->set(vectorIndex, StringView(string_value)); | |
if constexpr (std::is_same_v<T, std::string>) { | |
vector->set( | |
vectorIndex, std::get<std::string>(columnReader_->extract_value(messageIndex))); | |
} else { | |
vector->set( | |
vectorIndex, std::get<T>(columnReader_->extract_value(messageIndex))); | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpVectorLoader.cpp around lines 47 to 50,
the StringView is constructed from a local std::string variable that goes out of
scope immediately after, causing a use-after-free. To fix this, ensure the
string data outlives the StringView by either storing the string persistently or
copying the string data into a buffer that remains valid for the lifetime of the
StringView before setting it on the vector.
Success | ||
}; | ||
|
||
enum class ColumnType { String, Integer, Float, Array, Boolean, Unknown = -1 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider making Unknown
the last enum value without explicit assignment
Having Unknown = -1
in an enum that otherwise uses default unsigned values can lead to issues when used as array indices or in switch statements. Consider moving Unknown
to the end without an explicit value assignment.
-enum class ColumnType { String, Integer, Float, Array, Boolean, Unknown = -1 };
+enum class ColumnType { String, Integer, Float, Array, Boolean, Unknown };
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
enum class ColumnType { String, Integer, Float, Array, Boolean, Unknown = -1 }; | |
enum class ColumnType { String, Integer, Float, Array, Boolean, Unknown }; |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpCursor.h at line 52, the enum class
ColumnType assigns Unknown the value -1, which can cause issues when used as
array indices or in switch statements. To fix this, move Unknown to be the last
enum value and remove the explicit assignment so it uses the default incremented
value, ensuring consistent and safe usage.
|
||
ErrorCode errorCode_; | ||
|
||
clp_s::InputSource inputSource_{clp_s::InputSource::Filesystem}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Remove redundant default initialization of inputSource_
The member variable inputSource_
is initialized with a default value here but is also set from the constructor parameter. This redundant initialization could be confusing and should be removed.
- clp_s::InputSource inputSource_{clp_s::InputSource::Filesystem};
+ clp_s::InputSource inputSource_;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
clp_s::InputSource inputSource_{clp_s::InputSource::Filesystem}; | |
clp_s::InputSource inputSource_; |
🤖 Prompt for AI Agents
In the file velox/connectors/clp/search_lib/ClpCursor.h around line 108, the
member variable `inputSource_` is being initialized with a default value of
`clp_s::InputSource::Filesystem`, but it is also set from a constructor
parameter. This redundant initialization is unnecessary and could be confusing.
Remove the default initialization of `inputSource_` to simplify the code and
avoid potential confusion.
class Projection; | ||
} // namespace clp_s::search | ||
|
||
namespace lp_s::search::ast { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be clp_s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
33-35
: Namespace comment typo still present.The closing comment says
lp_s::search::ast
; it should beclp_s::search::ast
.
This is the exact issue flagged in earlier reviews and by the clang-format hook.-} // namespace lp_s::search::ast +} // namespace clp_s::search::ast
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
velox/connectors/clp/search_lib/ClpQueryRunner.h
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Run Checks
velox/connectors/clp/search_lib/ClpQueryRunner.h
[error] 32-32: clang-format formatting error: incorrect namespace comment 'lp_s::search::ast' should be 'clp_s::search::ast'. The hook modified the file to fix this.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
std::shared_ptr<clp_s::search::ast::Expression> expr_; | ||
std::shared_ptr<clp_s::SchemaTree> schemaTree_; | ||
std::shared_ptr<clp_s::search::Projection> projection_; | ||
std::vector<clp_s::BaseColumnReader*> projectedColumns_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Member expr_
is never initialised or used.
expr_
is declared but not set in the constructor and appears unused, leading to dead code and potential confusion. Either remove it or initialise it and use it.
- projection_(projection) {}
+ expr_(expr),
+ projection_(projection) {}
If the member is genuinely unnecessary, delete it instead.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpQueryRunner.h around lines 81 to 84, the
member variable expr_ is declared but never initialized or used, resulting in
dead code. Review whether expr_ is needed; if it is, initialize it properly in
the constructor and use it as intended. Otherwise, remove the declaration of
expr_ entirely to clean up the code.
/// represent the columns involved in the scanning operation. | ||
const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() { | ||
return projectedColumns_; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Make accessor const
-correct.
getProjectedColumns
does not modify the object; mark the method const
to allow calls on const instances and signal intent.
- const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() {
+ const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// represent the columns involved in the scanning operation. | |
const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() { | |
return projectedColumns_; | |
} | |
/// represent the columns involved in the scanning operation. | |
const std::vector<clp_s::BaseColumnReader*>& getProjectedColumns() const { | |
return projectedColumns_; | |
} |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpQueryRunner.h around lines 75 to 78, the
method getProjectedColumns should be marked as const since it does not modify
the object. Add the const qualifier at the end of the method declaration to make
it a const method, enabling calls on const instances and clarifying that it does
not alter object state.
#include "clp_s/search/QueryRunner.hpp" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Include missing standard headers.
std::unordered_map
, std::vector
, and std::shared_ptr
are used in this header but the corresponding headers are not included. Relying on transitive includes makes the file fragile and breaks IWYU guidelines.
+#include <memory>
+#include <unordered_map>
+#include <vector>
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#include "clp_s/search/QueryRunner.hpp" | |
#include "clp_s/search/QueryRunner.hpp" | |
#include <memory> | |
#include <unordered_map> | |
#include <vector> |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpQueryRunner.h around lines 19 to 20, the
header uses std::unordered_map, std::vector, and std::shared_ptr but does not
include their respective standard headers. Add #include directives for
<unordered_map>, <vector>, and <memory> to explicitly include these standard
headers and comply with IWYU guidelines.
uint64_t fetchNext( | ||
uint64_t numRows, | ||
const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
fetchNext
should use override
and avoid heap-allocating the index vector.
- If this method is meant to override the base
QueryRunner::fetchNext
, mark it withoverride
to catch signature mismatches at compile-time. - Passing a
shared_ptr<std::vector<uint64_t>>
forces callers to heap-allocate and share ownership of a simple scratch buffer, which is unnecessary and hurts performance. A non-owning reference (e.g.,std::vector<uint64_t>&
) is sufficient.
- uint64_t fetchNext(
- uint64_t numRows,
- const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices);
+ uint64_t fetchNext(
+ uint64_t numRows,
+ std::vector<uint64_t>& filteredRowIndices) override;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uint64_t fetchNext( | |
uint64_t numRows, | |
const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices); | |
uint64_t fetchNext( | |
uint64_t numRows, | |
std::vector<uint64_t>& filteredRowIndices) override; |
🤖 Prompt for AI Agents
In velox/connectors/clp/search_lib/ClpQueryRunner.h around lines 70 to 72, the
fetchNext method should be marked with override to ensure it correctly overrides
the base class method and catch signature mismatches at compile time.
Additionally, change the parameter from a shared_ptr to a non-owning reference
to a vector (std::vector<uint64_t>&) to avoid unnecessary heap allocation and
shared ownership, improving performance and simplifying usage.
|
||
namespace clp_s::search::ast { | ||
class Expression; | ||
} // namespace lp_s::search::ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} // namespace lp_s::search::ast | |
} // namespace clp_s::search::ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-format failed due to this
Description
Overview
The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.
The Velox CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.
To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.
Core Classes
ClpDataSource
This class extends
DataSource
and implements theaddSplit
andnext
methods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. OnlyARRAY(VARCHAR)
and primitive leaf fields likeBIGINT
,DOUBLE
,BOOLEAN
andVARCHAR
are projected.When a split is added, a
ClpCursor
is created with the archive path and input source. The query is parsed and simplified into an AST. Onnext
, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.ClpCursor
This class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas.
ClpCursor
usesClpQueryRunner
to initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.The three log messages have varying schemas. If we run a KQL query
a: World AND b: 0
, it will skip loading the third message because it's schema does not match the query (there's nob
field). And if the query isa: random AND b: 0
, it will even skip scanning the first two records, becauserandom
cannot be found in the dictionary.ClpQueryRunner
This class extends the generic CLP
QueryRunner
to support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.ClpVectorLoader
In CLP, values are decoded and read from a
BaseColumnReader
. TheClpVectorLoader
is custom VeloxVectorLoader
that loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Release Notes
en-CA
New Features
Tests
Documentation