-
Notifications
You must be signed in to change notification settings - Fork 0
feat(clp-connector): Add config to support searching archives from S3. #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-connector): Add config to support searching archives from S3. #21
Conversation
WalkthroughAdds an S3 auth-provider abstraction and CLP_PACKAGE implementation; wires provider selection and environment-variable export into ClpConfig; uses the provider in ClpDataSource to build S3 URLs for S3 splits; adds provider base and implementation, updates build files, adds tests, and documents new config keys. Changes
Sequence Diagram(s)sequenceDiagram
participant CONFIG as ClpConfig
participant PROVIDER as ClpPackageS3AuthProvider
participant DS as ClpDataSource
participant CUR as ClpCursor
CONFIG->>PROVIDER: instantiate (if clp.s3-auth-provider == CLP_PACKAGE)
CONFIG->>PROVIDER: exportAuthEnvironmentVariables()
DS->>CONFIG: s3AuthProvider()
DS->>PROVIDER: constructS3Url(split.path)
PROVIDER-->>DS: full S3 URL
DS->>CUR: construct with S3 URL (or raw path for FS)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 28
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (13)
CMake/resolve_dependency_modules/clp.cmake
(1 hunks)velox/connectors/clp/CMakeLists.txt
(0 hunks)velox/connectors/clp/ClpConfig.cpp
(2 hunks)velox/connectors/clp/ClpConfig.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(3 hunks)velox/connectors/clp/ClpDataSource.h
(1 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h
(1 hunks)velox/connectors/clp/tests/CMakeLists.txt
(1 hunks)velox/connectors/clp/tests/ClpConfigTest.cpp
(1 hunks)
💤 Files with no reviewable changes (1)
- velox/connectors/clp/CMakeLists.txt
👮 Files not reviewed due to content moderation or server errors (2)
- velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
- velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (21)
CMake/resolve_dependency_modules/clp.cmake (1)
19-19
: LGTM! Dependency update aligns with S3 functionality.The Git tag update to commit
581bd46198a97a89b174849851cc3f1f2100e466
is necessary to support the new S3 authentication provider functionality introduced in this PR.velox/connectors/clp/tests/CMakeLists.txt (1)
14-14
: LGTM! Test coverage expansion for new configuration functionality.Adding
ClpConfigTest.cpp
to the test executable appropriately expands test coverage for the new S3 authentication provider configuration features.velox/connectors/clp/search_lib/CMakeLists.txt (1)
19-20
: LGTM! Proper integration of S3 auth provider source files.The addition of the four new S3 authentication provider files (
ClpPackageS3AuthProvider.cpp/.h
andClpS3AuthProviderBase.cpp/.h
) to the static library build configuration is appropriate and maintains consistent file organisation.Also applies to: 23-24
velox/connectors/clp/ClpDataSource.cpp (2)
23-23
: LGTM! Proper integration of S3 auth provider.The include statement and member initialization correctly integrate the S3 authentication provider from the configuration.
Also applies to: 42-42
112-112
: No null-check needed for s3AuthProvider_
ClpConfig’s constructor always initializes s3AuthProvider_ (and will abort on failure), so any later call to s3AuthProvider_->constructS3Url() is safe. There’s no need to add a null check here.velox/connectors/clp/ClpConfig.cpp (2)
28-35
: LGTM: Clean enum conversion with proper validation.The
stringToS3AuthProvider
function correctly validates input and converts strings to enum values with appropriate error handling.
66-68
: LGTM: Simple and correct getter implementation.The getter method correctly returns the stored S3 auth provider instance.
velox/connectors/clp/ClpConfig.h (3)
25-31
: LGTM: Clean design with forward declaration and extensible enum.The forward declaration avoids unnecessary includes, and the S3AuthProvider enum is designed to be easily extensible for future auth provider types.
38-38
: LGTM: Configuration key follows naming convention.The configuration key
clp.s3-auth-provider
follows the established naming pattern and is appropriately scoped.
41-53
: LGTM: Clean interface design.The constructor and accessor method provide a clean interface for S3 auth provider management. The private member is appropriately encapsulated.
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (2)
28-48
: LGTM: Well-designed abstract interface with clear documentation.The interface correctly defines the contract for S3 auth providers with appropriate pure virtual methods and good documentation.
57-65
: LGTM: Well-documented protected helper methods.The protected static methods provide a clean abstraction for environment variable management with appropriate documentation.
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (3)
33-41
: LGTM: Well-organized configuration and environment variable constants.The configuration keys and AWS environment variable names are clearly defined and follow established naming conventions. The separation between config keys and env var names provides good clarity.
27-31
: LGTM: Clean inheritance and constructor design.The class correctly inherits from the base class and uses constructor delegation appropriately.
42-47
: Endpoint caching is intentional and appropriateThe implementation caches
endPoint_
in constructS3Url to avoid repeated config lookups on each call, whereas exportAuthEnvironmentVariables()—a one-off, const method—reads credentials directly without storing them. This design aligns with expected usage patterns; no changes are needed.velox/connectors/clp/tests/ClpConfigTest.cpp (6)
85-119
: Comprehensive test coverage with good edge case handling.The test method effectively covers both the full configuration scenario (with session token) and the minimal configuration scenario (without session token). The environment variable validation is thorough and appropriately uses the helper methods.
77-83
: Good test for case-insensitive configuration handling.This test effectively validates that the auth provider string is handled case-insensitively, which is important for user experience and configuration flexibility.
31-39
: LGTM!The test fixture is well-designed with a clean helper method for building ClpConfig instances. The use of shared_ptr and move semantics is appropriate.
56-64
: LGTM!The test fixture follows the same clean pattern as the parent fixtures and provides a useful helper method for building test instances.
77-83
: LGTM!Good test for case-insensitive auth provider handling. The mixed case test string effectively validates this functionality.
85-119
: Comprehensive test with excellent coverage.This test thoroughly validates the AWS credential export functionality, including both scenarios with and without session tokens. The test structure is well-designed with clear test data and proper verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (7)
velox/connectors/clp/ClpDataSource.h (1)
31-31
: Add missing header for ClpS3AuthProviderBase.The type
ClpS3AuthProviderBase
needs to be fully declared for thestd::shared_ptr<ClpS3AuthProviderBase>
member at line 106.Also applies to: 106-106
velox/connectors/clp/ClpConfig.cpp (1)
50-64
: Constructor performs heavy initialization - consider implications.The constructor modifies global process state by exporting environment variables. This could cause issues if multiple instances are created or if initialization fails.
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (3)
44-44
: Fix typo in error message."Unsuppported OS" should be "Unsupported OS".
59-59
: Fix Windows implementation to use _putenv_s.The Windows implementation uses
_putenv
instead of_putenv_s
, which is inconsistent with the setup function.
64-64
: Fix typo in error message."Unsuppported OS" should be "Unsupported OS".
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (2)
39-39
: Consider return type optimization for URL construction.The method returns
const std::string
which forces a copy. Consider returningstd::string
directly for better move semantics.
41-48
: Documentation mentions specific environment variable names - consider consistency.The documentation mentions specific AWS environment variable names, but they might be implementation-specific. Consider whether this should be more generic in the base class.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
velox/connectors/clp/ClpConfig.cpp
(2 hunks)velox/connectors/clp/ClpDataSource.h
(2 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h
(1 hunks)velox/connectors/clp/tests/ClpConfigTest.cpp
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: when analyzing include dependencies, pay special attention to types used in template parameters and ...
Learnt from: gibber9809
PR: y-scope/velox#15
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:22-22
Timestamp: 2025-07-11T15:21:09.658Z
Learning: When analyzing include dependencies, pay special attention to types used in template parameters and enum values like `clp_s::NodeType`, as these may not be immediately obvious in a surface-level review but are essential for compilation.
Applied to files:
velox/connectors/clp/ClpDataSource.h
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.174Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.
Applied to files:
velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (7)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (2)
23-27
: LGTM! Input validation and simplified implementation.The method now properly validates that
splitPath
is not empty and uses a pre-initializedendPoint_
member, which addresses the previous thread safety concerns about lazy initialization.
29-54
: Excellent security improvement - credentials no longer logged.The method now properly handles AWS credential configuration without logging sensitive values, which addresses the critical security vulnerability identified in previous reviews. The implementation correctly:
- Validates required configuration parameters with meaningful error messages
- Sets up environment variables without exposing credential values in logs
- Handles optional session tokens appropriately
- Returns a boolean success indicator
velox/connectors/clp/tests/ClpConfigTest.cpp (4)
53-62
: LGTM! Proper null pointer handling implemented.The
checkEnvironmentVariableEquals
method now correctly handles the case wherestd::getenv
returnsnullptr
, addressing the previous security concern about potential crashes. The logic properly distinguishes between undefined variables and variables with actual values.
77-83
: LGTM! Clear test for unsupported auth provider.The test properly validates that unsupported auth provider types throw appropriate exceptions with meaningful error messages.
85-91
: LGTM! Good coverage of case-insensitive parsing.The test validates that the auth provider configuration is parsed in a case-insensitive manner, which is important for user experience.
93-127
: Comprehensive test coverage for environment variable export.The test thoroughly validates both scenarios:
- Configuration with all parameters including optional session token
- Configuration without session token, ensuring proper cleanup
The test correctly uses the team's preferred explicit boolean comparison style with
std::nullopt
for checking undefined environment variables.velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (1)
27-31
: Well-designed class declaration and constructor.The use of
explicit
constructor and proper delegation to the base class constructor follows best practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
velox/connectors/clp/tests/ClpConfigTest.cpp
(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
📚 Learning: in velox/connectors/clp/clpconfig.cpp, the `parseconfigandexportauthenvironmentvariables()` method i...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
Applied to files:
velox/connectors/clp/tests/ClpConfigTest.cpp
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.223Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.
Applied to files:
velox/connectors/clp/tests/ClpConfigTest.cpp
🪛 GitHub Actions: Run Checks
velox/connectors/clp/tests/ClpConfigTest.cpp
[error] 197-197: clang-format formatting check failed. The file was modified by the clang-format hook to fix formatting issues.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (6)
velox/connectors/clp/tests/ClpConfigTest.cpp (6)
1-25
: LGTM!Copyright header and includes are properly structured and complete for the test functionality.
30-125
: Excellent cross-platform environment variable management.The implementation properly handles environment variable isolation across Windows and Unix-like systems, addressing the test pollution concerns from previous reviews. The logic correctly saves original state and restores it after tests.
127-184
: Well-structured test fixture hierarchy.The test fixtures provide good separation of concerns and the environment variable checking logic properly handles null cases. The RAII pattern for environment variable management is correctly implemented.
188-194
: LGTM!The test correctly validates that unsupported auth provider types trigger appropriate exceptions with specific error messages.
196-203
: Good test coverage for case insensitive auth provider handling.The test effectively validates that mixed case auth provider names are handled correctly, which is important for user-friendly configuration.
205-239
: Comprehensive test coverage for AWS environment variable export.The test thoroughly validates both scenarios (with and without session token) and correctly uses the team's preferred explicit boolean comparison style. The logic properly verifies that configuration values are exported as environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly requesting changes to some docstrings + method names, but otherwise looks good.
Per side-bar we're sticking with this approach for now, but it would be good to have clp-s expose an auth provider interface so that we have the option to pass auth tokens more directly.
|
||
namespace facebook::velox::connector::clp { | ||
|
||
void ClpS3AuthProviderBase::setupEnvironmentVariable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename this to setEnvironmentVariable
?
/// Retrieves some customized config option value and save them to the class | ||
/// property. Also exports the three environment variables needed by CLP-s | ||
/// to system: | ||
/// AWS_ENV_VAR_ACCESS_KEY_ID | ||
/// AWS_ENV_VAR_SECRET_ACCESS_KEY | ||
/// AWS_ENV_VAR_SESSION_TOKEN (optional) | ||
/// So that in runtime, CLP-s’s code can execute S3-related logic correctly. | ||
/// | ||
/// @return Did exportation succeed or not. | ||
virtual bool parseConfigAndExportAuthEnvironmentVariables() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the fact that this method parses config is an implementation detail, and that the important part is that auth providers should export those three environment variables.
Maybe we can rename the method to exportAuthEnvironmentVariables
and change the description to something like:
/// Exports the three environment variables needed by CLP-S:
/// AWS_ENV_VAR_ACCESS_KEY_ID
/// ....
/// So that at runtime CLP-S' code can authenticate with S3.
/// @return Whether environment variable export succeeded or not.
Also one question, why do they all have ENV_VAR
in the name (i.e. AWS_ENV_VAR_ACCESS_KEY_ID
vs AWS_ACCESS_KEY_ID
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion, will apply. For the ENV_VAR
I copied them from Haiqi's s3_utils.py
. I will remove the ENV_VAR
here, because they are not the actual exported env var's names and misleading.
|
||
namespace { | ||
|
||
class ClpEnvironmentVariableGuard { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't read this class' implementation in detail but it looks generally correct.
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
velox/connectors/clp/ClpConfig.cpp (1)
50-64
: Constructor modifies global environment stateThe constructor calls
exportAuthEnvironmentVariables()
which modifies the global process environment. This could cause issues if multipleClpConfig
instances are created or if initialization fails partway through.Consider documenting that this constructor has global side effects in the header file.
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (2)
21-23
: Remove redundant ConfigBase forward declarationThe
ConfigBase
forward declaration is unnecessary because it's already declared in the included ClpS3AuthProviderBase.h.-namespace facebook::velox::config { -class ConfigBase; -} // namespace facebook::velox::config -
42-44
: Add documentation for virtual function overrides+ /// Constructs a full S3 URL from the given split path. + /// @param splitPath The original split path from the coordinator + /// @return The constructed S3 URL with proper endpoint and authentication std::string constructS3Url(std::string_view splitPath) override; + /// Parses S3 configuration and exports required AWS environment variables. + /// @return true if configuration is valid and environment variables are exported successfully bool exportAuthEnvironmentVariables() override;velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (1)
41-49
: Consider more generic documentation for the base class.The documentation mentions specific AWS environment variable names, which might be too implementation-specific for a base class. Consider making it more generic to allow for different authentication schemes.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
velox/connectors/clp/ClpConfig.cpp
(2 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h
(1 hunks)velox/connectors/clp/tests/ClpConfigTest.cpp
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
📚 Learning: in velox/connectors/clp/clpconfig.cpp, the `parseconfigandexportauthenvironmentvariables()` method i...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
Applied to files:
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h
velox/connectors/clp/tests/ClpConnectorTest.cpp
velox/connectors/clp/ClpConfig.cpp
velox/connectors/clp/tests/ClpConfigTest.cpp
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.223Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.
Applied to files:
velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (14)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
25-25
: Test configuration changes look goodThe addition of S3 authentication configuration parameters with dummy values is appropriate for test setup.
Also applies to: 57-63
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (2)
29-51
: Environment variable management implementation looks goodThe platform-specific implementation with proper error handling and sanity checks is well done.
53-71
: Unset operation properly implementedThe unset functionality with platform-specific handling and verification is correct.
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (2)
23-27
: URL construction implementation is correctGood validation of splitPath and straightforward URL construction.
29-54
: Environment variable export properly implementedThe implementation correctly:
- Validates all required configuration parameters
- Handles endpoint normalization
- Sets/unsets environment variables appropriately
- Avoids logging sensitive credentials
All previous security and validation concerns have been addressed.
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (3)
1-25
: Clean header structure with minimal dependencies.Good use of forward declaration for
ConfigBase
to reduce compilation dependencies.
28-33
: Well-designed base class structure.The explicit constructor and defaulted virtual destructor follow best practices for abstract base classes.
51-69
: Well-designed helper methods for environment variable management.The static helper methods provide a clean abstraction for platform-specific operations with clear documentation.
velox/connectors/clp/tests/ClpConfigTest.cpp (6)
30-125
: Excellent implementation of environment variable isolation.This properly addresses the concern about test environment pollution by saving and restoring the original environment state. The platform-specific handling is comprehensive.
127-149
: Well-structured test fixture with proper cleanup.The integration of
ClpEnvironmentVariableGuard
in SetUp/TearDown ensures test isolation.
153-172
: Proper null handling implemented.This implementation correctly addresses the previous concern about null pointer dereference when
std::getenv
returnsnullptr
.
187-193
: Clean test for unsupported auth provider.The test correctly validates exception handling for unsupported auth providers.
195-202
: Good test coverage for case-insensitive parsing.This test ensures auth provider names are handled case-insensitively, which improves usability.
204-236
: Comprehensive test coverage for environment variable export.The test properly validates both scenarios (with and without session token) and correctly verifies that the session token environment variable is unset when not provided in the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! PR title is fine, but maybe change "configs" to "config".
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
velox/docs/configs.rst
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
09db05b
to
bc60f10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
velox/docs/develop/connectors.rst (1)
154-169
: Tighten wording and add cross-reference; fix “CLP-s” phrasingMinor editorial tweaks improve clarity and consistency.
-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment - variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations - correctly. This function is invoked immediately after the configuration is parsed. +1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment + variables required by CLP processes. This ensures that, at runtime, CLP components can execute S3-related operations + correctly. This function is invoked immediately after the configuration is parsed. @@ -Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration -options for passing any required information. +Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration +options for passing any required information. See also the CLP Connector section in :doc:`configs` for related +configuration keys (e.g., ``clp.s3-auth-provider`` and ``clp.s3-*``).
♻️ Duplicate comments (5)
velox/docs/configs.rst (5)
913-925
: Scope note (workers, S3-only) is good; clarify value case-sensitivity and usage timing
- You already state it’s required when clp.storage-type = S3. Consider adding that this is used on workers.
- Please clarify whether the allowed value is case-sensitive in the implementation.
- - Specifies the S3 auth provider used by CLP to obtain authentication information (typically + - Specifies the S3 auth provider used by CLP to obtain authentication information on workers (typically the three environment variables ``AWS_ACCESS_KEY_ID``, ``AWS_SECRET_ACCESS_KEY``, and the optional ``AWS_SESSION_TOKEN``). This option is required when the ``clp.storage-type`` @@ - in the config file. **Allowed values:** ``CLP_PACKAGE``. + in the config file. **Allowed values:** ``CLP_PACKAGE`` (clarify whether value matching is case-sensitive).To verify case-sensitivity and any default handling, run:
#!/bin/bash # Inspect how clp.s3-auth-provider is parsed and compared. rg -n $'clp\\.s3-auth-provider' -A 5 -B 5 rg -n $'CLP_PACKAGE' -A 3 -B 3 # Look for toLower/strcasecmp or similar to infer case handling. rg -n $'toLower|tolower|strcasecmp|equalsIgnoreCase' -A 2 -B 2
925-930
: Add a brief security note to discourage plaintext credentialsReinforce safe handling for the access key ID.
- - The S3 access key ID, exported as the ``AWS_ACCESS_KEY_ID`` environment variable and used - for authentication. + - The S3 access key ID, exported as the ``AWS_ACCESS_KEY_ID`` environment variable and used + for authentication. Avoid storing credentials in plaintext configs; prefer secret managers + or environment injection and ensure strict file permissions if used.
930-935
: Clarify S3 endpoint format (scheme, trailing slash) and keep key name consistentEnsure operators provide a correctly formatted endpoint and avoid accidental double slashes.
- - The S3 endpoint. For the CLP package, the complete split URL is formatted as - ``<clp.s3-end-point>/<split_path>``, where ``<split_path>`` is obtained from the CLP - package's metadata database and stored in the ``ClpConnectorSplit`` class. + - The S3 endpoint (must include scheme and host, e.g., ``https://s3.amazonaws.com``; omit trailing slash). + For the CLP package, the full split URL is constructed as: + ``<clp.s3-end-point>/<split_path>``, where ``<split_path>`` is obtained from the CLP + package’s metadata database and stored in the ``ClpConnectorSplit`` class.
936-941
: Mark secret access key as sensitive and advise secure handlingCall out that this should not be stored in plaintext.
- - The S3 secret access key, exported as the ``AWS_SECRET_ACCESS_KEY`` environment variable - and used for authentication. + - The S3 secret access key, exported as the ``AWS_SECRET_ACCESS_KEY`` environment variable + and used for authentication. Treat as sensitive; avoid plaintext storage and prefer + secure secret delivery mechanisms.
942-945
: Clarify session token usage and rotation expectationsNote this is typically temporary and not auto-refreshed.
- - The S3 session token, exported as the ``AWS_SESSION_TOKEN`` environment variable and used - for authentication. This option is optional. + - The S3 session token, exported as the ``AWS_SESSION_TOKEN`` environment variable and used + for authentication. Optional. Typically used with temporary credentials (e.g., STS) and + not auto-refreshed; ensure timely rotation if tokens expire.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
velox/docs/configs.rst
(1 hunks)velox/docs/develop/connectors.rst
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (3)
velox/docs/develop/connectors.rst (3)
127-131
: CLP connector overview reads wellThe additions clearly set context and introduce the S3 auth provider interface. No blocking issues here.
144-148
: LGTM: init flow descriptionGood clarity on recording query and split source and mapping output columns (including nested row types).
149-153
: LGTM: split processing pipelineAccurately reflects cursor creation from split path and lazy vector loading behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
velox/docs/develop/connectors.rst (1)
156-170
: Document the default provider and its config keys with an exampleGiven reviewer requests for config docs, add a short subsection describing the built-in provider and relevant keys.
ClpS3AuthProviderBase ~~~~~~~~~~~~~~~~~~~~~ @@ Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration options for passing any required information. + +Default implementation: ``ClpPackageS3AuthProvider`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +When ``clp.s3-auth-provider`` is set to ``S3``, the built-in ``ClpPackageS3AuthProvider`` is used. It recognises the +following configuration keys: + +- ``clp.s3-access-key-id`` → exported as ``AWS_ACCESS_KEY_ID`` +- ``clp.s3-secret-access-key`` → exported as ``AWS_SECRET_ACCESS_KEY`` +- ``clp.s3-session-token`` (optional) → exported as ``AWS_SESSION_TOKEN`` +- ``clp.s3-end-point`` → used to format the split URL (e.g., ``https://minio:9000``) + +Example: + +.. code-block:: ini + + # Connector configuration + clp.s3-auth-provider = S3 + clp.s3-access-key-id = <your-access-key-id> + clp.s3-secret-access-key = <your-secret-access-key> + # Optional for temporary credentials: + clp.s3-session-token = <your-session-token> + # S3-compatible endpoint (AWS S3, MinIO, etc.) + clp.s3-end-point = https://minio:9000
♻️ Duplicate comments (1)
velox/docs/develop/connectors.rst (1)
134-137
: Clarify “partial URL” with concrete examplesAdd explicit examples for complete vs partial S3 paths to remove ambiguity.
-``ClpConnectorSplit`` describes a data chunk using ``path``. This path may be the absolute file path to the split file -if it is stored on a local file system, or the complete (or partial) URL of the split if it is stored on S3. In the -latter case, when only a partial URL is provided, ``ClpS3AuthProviderBase`` provides a hook in ``ClpDataSource`` to -assist in constructing the full URL. Refer to :ref:`ClpS3AuthProviderBase<ClpS3AuthProviderBase>` for details. +``ClpConnectorSplit`` describes a data chunk using ``path``. This path may be the absolute file path to the split file +if it is stored on a local file system, or the complete (or partial) URL of the split if it is stored on S3. For example, +complete: ``s3://bucket/prefix/object.clp``, partial: ``bucket/prefix/object.clp``. When a partial URL is provided, +``ClpS3AuthProviderBase`` provides a hook in ``ClpDataSource`` to construct the full URL. Refer to +:ref:`ClpS3AuthProviderBase<ClpS3AuthProviderBase>` for details.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)velox/docs/develop/connectors.rst
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
56-56
: Ignore explicit CLP storage‐type configuration in testsThe CLP tests already pass an empty config and ClpConfig defaults storage‐type to “FS” (local file system). Since S3 logic only kicks in when storage‐type is “S3”, none of the AWS environment variables will ever be used in these tests. No changes are needed.
Likely an incorrect or invalid review comment.
velox/docs/develop/connectors.rst (1)
144-148
: Replace “KQL query” with neutral term unless KQL is intendedIs the query language actually KQL here? If not, use a neutral term to avoid confusion.
-During initialization, it records the KQL query and split source (S3 or local). It then iterates through +During initialization, it records the query and split source (S3 or local). It then iterates through
std::make_shared<config::ConfigBase>( | ||
std::unordered_map<std::string, std::string>{ | ||
{"clp.split-source", "local"}})); | ||
std::unordered_map<std::string, std::string>{})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Nit: include <unordered_map> explicitly.
You now reference std::unordered_map directly on this line. Consider adding the header to make the dependency explicit and avoid relying on transitive includes.
Add near the other includes:
#include <unordered_map>
🤖 Prompt for AI Agents
In velox/connectors/clp/tests/ClpConnectorTest.cpp around line 56, the test uses
std::unordered_map but the header <unordered_map> is not explicitly included;
add the line '#include <unordered_map>' alongside the other include directives
at the top of the file so the dependency is explicit and does not rely on
transitive includes.
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as | ||
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the | ||
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive | ||
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Surface the new config selector up-front
Briefly mention that authentication and URL construction are driven by the new config key so readers know where to look.
-Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
+Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
+Authentication and S3 URL construction are selected via the ``clp.s3-auth-provider`` configuration key.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as | |
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the | |
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive | |
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication. | |
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as | |
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the | |
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive | |
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication. | |
Authentication and S3 URL construction are selected via the ``clp.s3-auth-provider`` configuration key. |
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 127 to 131, the introduction
to the CLP Connector should briefly call out that authentication and URL
construction are driven by the new config selector key so readers know where to
look; add one clear sentence after the first line stating that the connector
uses a new config key (mention its exact name if available) to select S3
authentication and to construct S3/FS URLs, and point readers to the
ClpS3AuthProviderBase section for customization details.
When a split is added, a ``ClpCursor`` is created with the split path and input source. The query is parsed | ||
and simplified into an AST. On ``next``, the cursor finds matching row indices and, if any exist, | ||
``ClpDataSource`` recursively creates a row vector composed of lazy vectors, which use CLP column readers to | ||
decode and load data as needed during execution. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Optional: briefly link or reference “lazy vectors” for readers unfamiliar with the concept
If there’s an existing Velox doc section on LazyVector, consider adding a cross-reference. Otherwise, one short clause like “(see LazyVector)” helps orient readers.
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 149 to 153, add a brief
reference to the LazyVector concept for readers unfamiliar with it; update the
sentence that mentions "lazy vectors" to include a short parenthetical
cross-reference like "(see LazyVector)" or, if a LazyVector section exists, link
to it (e.g., "see :ref:`lazy-vector`" or the appropriate doc anchor). Ensure the
wording is concise and non-disruptive, and use the existing documentation
cross-reference style used elsewhere in the docs.
1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment | ||
variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations | ||
correctly. This function is invoked immediately after the configuration is parsed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Clarify env vars and optionality; define “CLP-s” or avoid the term
Spell out which variables are exported and note that the session token is optional. Also consider defining “CLP-s” upon first use, or replace with “CLP search worker” if that’s what is meant.
-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
- variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
- correctly. This function is invoked immediately after the configuration is parsed.
+1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the required AWS
+ credentials to the environment: ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` (and ``AWS_SESSION_TOKEN``
+ when using temporary credentials). This ensures that, at runtime, the CLP search worker can execute S3-related
+ operations correctly. This function is invoked immediately after the configuration is parsed.
🏁 Script executed:
#!/bin/bash
# Locate usage and implementation of exportAuthEnvironmentVariables in ClpConfig.cpp
rg -n "exportAuthEnvironmentVariables" -n velox/connectors/clp/ClpConfig.cpp
Length of output: 147
🏁 Script executed:
#!/bin/bash
# Search for exportAuthEnvironmentVariables implementation with context
rg -C5 "exportAuthEnvironmentVariables" -n velox/connectors
Length of output: 6354
🏁 Script executed:
#!/bin/bash
# Confirm the AWS access key environment variable name
rg -n "kEnvAwsAccessKeyId" -n velox/connectors/clp
Length of output: 429
Clarify exportAuthEnvironmentVariables docs
Spell out exactly which AWS environment variables are set and replace “CLP-s” with “CLP search worker” for clarity.
In velox/docs/develop/connectors.rst (lines 161–163), update the snippet to:
-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
- variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
- correctly. This function is invoked immediately after the configuration is parsed.
+1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the required AWS
+ credentials to the environment: ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY``
+ (and ``AWS_SESSION_TOKEN`` when using temporary credentials). This ensures that, at runtime, the CLP search
+ worker can execute S3-related operations correctly. This function is invoked immediately after the configuration
+ is parsed.
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 161 to 163, the docstring for
exportAuthEnvironmentVariables should explicitly list the three AWS environment
variables being exported and replace the term “CLP-s” with “CLP search worker”;
update the text to say that exportAuthEnvironmentVariables() exports
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN to the
environment so the CLP search worker can perform S3 operations, and keep the
note that this function is invoked immediately after configuration parsing.
as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``, | ||
which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Use non-ambiguous S3 URL examples (virtual-hosted vs path-style) and acknowledge custom endpoints
Current example may imply a single canonical form. Suggest showing both common styles and noting custom endpoints (e.g., MinIO).
- as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``,
- which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL.
+ as its argument, allowing customization in URL construction. For example, the ``path`` could be
+ ``"bucket/prefix/object.clp"`` or simply ``"prefix/object.clp"``. The provider may construct the full URL using
+ virtual-hosted-style (e.g., ``https://bucket.s3.<region>.amazonaws.com/prefix/object.clp``), path-style
+ (e.g., ``https://s3.<region>.amazonaws.com/bucket/prefix/object.clp``), or a custom endpoint (e.g., MinIO).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``, | |
which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL. | |
as its argument, allowing customization in URL construction. For example, the ``path`` could be | |
``"bucket/prefix/object.clp"`` or simply ``"prefix/object.clp"``. The provider may construct the full URL using | |
virtual-hosted-style (e.g., ``https://bucket.s3.<region>.amazonaws.com/prefix/object.clp``), path-style | |
(e.g., ``https://s3.<region>.amazonaws.com/bucket/prefix/object.clp``), or a custom endpoint (e.g., MinIO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving due to time constraints. Will open another PR to fix things later.
For the PR title, how about:
feat(clp-connector): Add config to support searching archives from S3.
Description
Currently, we only support scanning archives stored locally. However, CLP already supported reading from S3. So we can definitely support this in Presto and Velox as well.
This PR added a new config option in worker's side,
clp.s3-auth-provider
. We extracted an interface calledClpS3AuthProviderBase
. User can implement the two functions of this interface:exportAuthEnvironmentVariables()
parses some user-customized config options and exports the three environment variables needed by CLP-s to system. So that in runtime, CLP-s’s code can execute S3-related logic correctly. This function will be called immediately after the config is parsed.constructS3Url()
constructs the actual S3 URL of the split. It takes the originalsplitPath
which is provided by theClpSplitProvder
in coordinator's end as an argument, so user can do customization here.This interface also holds a reference of the
config_
, so user can define their own customized config option to pass some necessary information.This PR also added a default implementation for the
ClpS3AuthProviderBase
which isClpPackageS3AuthProvider
.The config options for
ClpPackageS3AuthProvider
(i.e., whenclp.s3-auth-provider
is set toS3
):clp.s3-access-key-id
is the S3 access key ID, exported as theAWS_ACCESS_KEY_ID
environment variable and used for authentication.clp.s3-end-point
is the S3 endpoint. For the CLP package, the complete split URL is formatted as<clp.s3-end-point>/<split_path>
, where<split_path>
is obtained from the CLP package's metadata database and stored in theClpConnectorSplit
class.clp.s3-secret-access-key
is the S3 secret access key, exported as theAWS_SECRET_ACCESS_KEY
environment variable and used for authentication.clp.s3-session-token
is the S3 session token, exported as theAWS_SESSION_TOKEN
environment variable and used for authentication. This option is optional.This PR also updates the CLP submodule, because the old version somehow had issues with S3.
Checklist
breaking change.
Validation performed
Pass the CI.
E2E testing steps:
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores