Skip to content

Conversation

anlowee
Copy link

@anlowee anlowee commented Jul 31, 2025

Description

Currently, we only support scanning archives stored locally. However, CLP already supported reading from S3. So we can definitely support this in Presto and Velox as well.

This PR added a new config option in worker's side, clp.s3-auth-provider. We extracted an interface called ClpS3AuthProviderBase. User can implement the two functions of this interface:

  1. exportAuthEnvironmentVariables() parses some user-customized config options and exports the three environment variables needed by CLP-s to system. So that in runtime, CLP-s’s code can execute S3-related logic correctly. This function will be called immediately after the config is parsed.
  2. constructS3Url() constructs the actual S3 URL of the split. It takes the original splitPath which is provided by the ClpSplitProvder in coordinator's end as an argument, so user can do customization here.

This interface also holds a reference of the config_, so user can define their own customized config option to pass some necessary information.

This PR also added a default implementation for the ClpS3AuthProviderBase which is ClpPackageS3AuthProvider.

The config options for ClpPackageS3AuthProvider (i.e., when clp.s3-auth-provider is set to S3):

  • clp.s3-access-key-id is the S3 access key ID, exported as the AWS_ACCESS_KEY_ID environment variable and used for authentication.
  • clp.s3-end-point is the S3 endpoint. For the CLP package, the complete split URL is formatted as <clp.s3-end-point>/<split_path>, where <split_path> is obtained from the CLP package's metadata database and stored in the ClpConnectorSplit class.
  • clp.s3-secret-access-key is the S3 secret access key, exported as the AWS_SECRET_ACCESS_KEY environment variable and used for authentication.
  • clp.s3-session-token is the S3 session token, exported as the AWS_SESSION_TOKEN environment variable and used for authentication. This option is optional.

This PR also updates the CLP submodule, because the old version somehow had issues with S3.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Pass the CI.

E2E testing steps:

  1. Prepare clp-package (DM me for the config) and start the package.
  2. Compress the data, for example:
sbin/compress.sh --timestamp-key "t.dollar_sign_date" --dataset "data1"  ../mongod-256MB-processed.log
  1. Then launch the coordinator and worker (DM me for the config).
  2. Run CLI and test the following query, which should return a row of result.
select t.dollar_sign_date from data1 limit 1;

Summary by CodeRabbit

  • New Features

    • CLP connector: configurable S3 auth provider, S3 credentials & endpoint settings, and authenticated S3 URL construction for S3 storage.
  • Bug Fixes

    • More reliable S3 URL handling and authentication setup for S3 splits.
  • Tests

    • Added unit tests for CLP config parsing and S3 auth environment variable management.
  • Chores

    • Updated CLP dependency tag and minor build file cleanups.

Copy link

coderabbitai bot commented Jul 31, 2025

Walkthrough

Adds an S3 auth-provider abstraction and CLP_PACKAGE implementation; wires provider selection and environment-variable export into ClpConfig; uses the provider in ClpDataSource to build S3 URLs for S3 splits; adds provider base and implementation, updates build files, adds tests, and documents new config keys.

Changes

Cohort / File(s) Change Summary
CLP Dependency Update
CMake/resolve_dependency_modules/clp.cmake
Updated the clp dependency Git tag in FetchContent_Declare to a newer commit.
CLP Connector Build Config
velox/connectors/clp/CMakeLists.txt
Removed a commented-out CLP_SRC_DIR line.
ClpConfig: API & ctor
velox/connectors/clp/ClpConfig.h, velox/connectors/clp/ClpConfig.cpp
Added S3AuthProvider enum and kAuthProvider key; introduced ctor taking std::shared_ptr<const config::ConfigBase>, instantiated selected provider (supports CLP_PACKAGE), exported provider env vars, cached storageType_, added s3AuthProvider() getter and s3AuthProvider_ member.
ClpDataSource: use auth provider
velox/connectors/clp/ClpDataSource.h, velox/connectors/clp/ClpDataSource.cpp
Forward-declared ClpS3AuthProviderBase; added s3AuthProvider_ member initialized from ClpConfig; when storageType is S3, call constructS3Url(splitPath) to produce URL passed to ClpCursor.
Search lib build files
velox/connectors/clp/search_lib/CMakeLists.txt
Added ClpPackageS3AuthProvider.* and ClpS3AuthProviderBase.* to the clp-s-search static library sources.
S3 auth provider base
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h, velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
New abstract base class with config_, pure virtual constructS3Url and exportAuthEnvironmentVariables, and protected helpers to set/unset environment variables (platform-aware).
CLP package provider implementation
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h, velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
New concrete provider that reads endpoint and AWS creds from config, normalizes endpoint, exports AWS env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional AWS_SESSION_TOKEN), and constructs full S3 URLs.
Tests: added and updated
velox/connectors/clp/tests/CMakeLists.txt, velox/connectors/clp/tests/ClpConfigTest.cpp, velox/connectors/clp/tests/ClpConnectorTest.cpp
Added ClpConfigTest.cpp with unit tests for ClpConfig and ClpPackageS3AuthProvider (environment isolation and verification); updated ClpConnectorTest setup and includes; test target updated to include new test.
Docs: config entries & connector docs
velox/docs/configs.rst, velox/docs/develop/connectors.rst
Added CLP S3 auth config options (clp.s3-auth-provider, clp.s3-access-key-id, clp.s3-end-point, clp.s3-secret-access-key, clp.s3-session-token), clarified clp.storage-type values and split path semantics, and documented ClpS3AuthProviderBase interface.

Sequence Diagram(s)

sequenceDiagram
    participant CONFIG as ClpConfig
    participant PROVIDER as ClpPackageS3AuthProvider
    participant DS as ClpDataSource
    participant CUR as ClpCursor

    CONFIG->>PROVIDER: instantiate (if clp.s3-auth-provider == CLP_PACKAGE)
    CONFIG->>PROVIDER: exportAuthEnvironmentVariables()
    DS->>CONFIG: s3AuthProvider()
    DS->>PROVIDER: constructS3Url(split.path)
    PROVIDER-->>DS: full S3 URL
    DS->>CUR: construct with S3 URL (or raw path for FS)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Possibly related PRs

  • feat: Add CLP connector. #14 — Extends the CLP connector with S3 auth provider support and related classes/tests; touches overlapping ClpConfig, ClpDataSource, and search_lib files.

Suggested reviewers

  • kirkrodrigues
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@anlowee anlowee marked this pull request as ready for review August 1, 2025 17:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f509a8 and 07b1488.

📒 Files selected for processing (13)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • velox/connectors/clp/CMakeLists.txt (0 hunks)
  • velox/connectors/clp/ClpConfig.cpp (2 hunks)
  • velox/connectors/clp/ClpConfig.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (3 hunks)
  • velox/connectors/clp/ClpDataSource.h (1 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (1 hunks)
  • velox/connectors/clp/tests/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/tests/ClpConfigTest.cpp (1 hunks)
💤 Files with no reviewable changes (1)
  • velox/connectors/clp/CMakeLists.txt
👮 Files not reviewed due to content moderation or server errors (2)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
  • velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (21)
CMake/resolve_dependency_modules/clp.cmake (1)

19-19: LGTM! Dependency update aligns with S3 functionality.

The Git tag update to commit 581bd46198a97a89b174849851cc3f1f2100e466 is necessary to support the new S3 authentication provider functionality introduced in this PR.

velox/connectors/clp/tests/CMakeLists.txt (1)

14-14: LGTM! Test coverage expansion for new configuration functionality.

Adding ClpConfigTest.cpp to the test executable appropriately expands test coverage for the new S3 authentication provider configuration features.

velox/connectors/clp/search_lib/CMakeLists.txt (1)

19-20: LGTM! Proper integration of S3 auth provider source files.

The addition of the four new S3 authentication provider files (ClpPackageS3AuthProvider.cpp/.h and ClpS3AuthProviderBase.cpp/.h) to the static library build configuration is appropriate and maintains consistent file organisation.

Also applies to: 23-24

velox/connectors/clp/ClpDataSource.cpp (2)

23-23: LGTM! Proper integration of S3 auth provider.

The include statement and member initialization correctly integrate the S3 authentication provider from the configuration.

Also applies to: 42-42


112-112: No null-check needed for s3AuthProvider_
ClpConfig’s constructor always initializes s3AuthProvider_ (and will abort on failure), so any later call to s3AuthProvider_->constructS3Url() is safe. There’s no need to add a null check here.

velox/connectors/clp/ClpConfig.cpp (2)

28-35: LGTM: Clean enum conversion with proper validation.

The stringToS3AuthProvider function correctly validates input and converts strings to enum values with appropriate error handling.


66-68: LGTM: Simple and correct getter implementation.

The getter method correctly returns the stored S3 auth provider instance.

velox/connectors/clp/ClpConfig.h (3)

25-31: LGTM: Clean design with forward declaration and extensible enum.

The forward declaration avoids unnecessary includes, and the S3AuthProvider enum is designed to be easily extensible for future auth provider types.


38-38: LGTM: Configuration key follows naming convention.

The configuration key clp.s3-auth-provider follows the established naming pattern and is appropriately scoped.


41-53: LGTM: Clean interface design.

The constructor and accessor method provide a clean interface for S3 auth provider management. The private member is appropriately encapsulated.

velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (2)

28-48: LGTM: Well-designed abstract interface with clear documentation.

The interface correctly defines the contract for S3 auth providers with appropriate pure virtual methods and good documentation.


57-65: LGTM: Well-documented protected helper methods.

The protected static methods provide a clean abstraction for environment variable management with appropriate documentation.

velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (3)

33-41: LGTM: Well-organized configuration and environment variable constants.

The configuration keys and AWS environment variable names are clearly defined and follow established naming conventions. The separation between config keys and env var names provides good clarity.


27-31: LGTM: Clean inheritance and constructor design.

The class correctly inherits from the base class and uses constructor delegation appropriately.


42-47: Endpoint caching is intentional and appropriate

The implementation caches endPoint_ in constructS3Url to avoid repeated config lookups on each call, whereas exportAuthEnvironmentVariables()—a one-off, const method—reads credentials directly without storing them. This design aligns with expected usage patterns; no changes are needed.

velox/connectors/clp/tests/ClpConfigTest.cpp (6)

85-119: Comprehensive test coverage with good edge case handling.

The test method effectively covers both the full configuration scenario (with session token) and the minimal configuration scenario (without session token). The environment variable validation is thorough and appropriately uses the helper methods.


77-83: Good test for case-insensitive configuration handling.

This test effectively validates that the auth provider string is handled case-insensitively, which is important for user experience and configuration flexibility.


31-39: LGTM!

The test fixture is well-designed with a clean helper method for building ClpConfig instances. The use of shared_ptr and move semantics is appropriate.


56-64: LGTM!

The test fixture follows the same clean pattern as the parent fixtures and provides a useful helper method for building test instances.


77-83: LGTM!

Good test for case-insensitive auth provider handling. The mixed case test string effectively validates this functionality.


85-119: Comprehensive test with excellent coverage.

This test thoroughly validates the AWS credential export functionality, including both scenarios with and without session tokens. The test structure is well-designed with clear test data and proper verification.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (7)
velox/connectors/clp/ClpDataSource.h (1)

31-31: Add missing header for ClpS3AuthProviderBase.

The type ClpS3AuthProviderBase needs to be fully declared for the std::shared_ptr<ClpS3AuthProviderBase> member at line 106.

Also applies to: 106-106

velox/connectors/clp/ClpConfig.cpp (1)

50-64: Constructor performs heavy initialization - consider implications.

The constructor modifies global process state by exporting environment variables. This could cause issues if multiple instances are created or if initialization fails.

velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (3)

44-44: Fix typo in error message.

"Unsuppported OS" should be "Unsupported OS".


59-59: Fix Windows implementation to use _putenv_s.

The Windows implementation uses _putenv instead of _putenv_s, which is inconsistent with the setup function.


64-64: Fix typo in error message.

"Unsuppported OS" should be "Unsupported OS".

velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (2)

39-39: Consider return type optimization for URL construction.

The method returns const std::string which forces a copy. Consider returning std::string directly for better move semantics.


41-48: Documentation mentions specific environment variable names - consider consistency.

The documentation mentions specific AWS environment variable names, but they might be implementation-specific. Consider whether this should be more generic in the base class.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32c31fb and 2a27297.

📒 Files selected for processing (7)
  • velox/connectors/clp/ClpConfig.cpp (2 hunks)
  • velox/connectors/clp/ClpDataSource.h (2 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (1 hunks)
  • velox/connectors/clp/tests/ClpConfigTest.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: when analyzing include dependencies, pay special attention to types used in template parameters and ...
Learnt from: gibber9809
PR: y-scope/velox#15
File: velox/connectors/clp/search_lib/ClpVectorLoader.cpp:22-22
Timestamp: 2025-07-11T15:21:09.658Z
Learning: When analyzing include dependencies, pay special attention to types used in template parameters and enum values like `clp_s::NodeType`, as these may not be immediately obvious in a surface-level review but are essential for compilation.

Applied to files:

  • velox/connectors/clp/ClpDataSource.h
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.174Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.

Applied to files:

  • velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (7)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (2)

23-27: LGTM! Input validation and simplified implementation.

The method now properly validates that splitPath is not empty and uses a pre-initialized endPoint_ member, which addresses the previous thread safety concerns about lazy initialization.


29-54: Excellent security improvement - credentials no longer logged.

The method now properly handles AWS credential configuration without logging sensitive values, which addresses the critical security vulnerability identified in previous reviews. The implementation correctly:

  1. Validates required configuration parameters with meaningful error messages
  2. Sets up environment variables without exposing credential values in logs
  3. Handles optional session tokens appropriately
  4. Returns a boolean success indicator
velox/connectors/clp/tests/ClpConfigTest.cpp (4)

53-62: LGTM! Proper null pointer handling implemented.

The checkEnvironmentVariableEquals method now correctly handles the case where std::getenv returns nullptr, addressing the previous security concern about potential crashes. The logic properly distinguishes between undefined variables and variables with actual values.


77-83: LGTM! Clear test for unsupported auth provider.

The test properly validates that unsupported auth provider types throw appropriate exceptions with meaningful error messages.


85-91: LGTM! Good coverage of case-insensitive parsing.

The test validates that the auth provider configuration is parsed in a case-insensitive manner, which is important for user experience.


93-127: Comprehensive test coverage for environment variable export.

The test thoroughly validates both scenarios:

  1. Configuration with all parameters including optional session token
  2. Configuration without session token, ensuring proper cleanup

The test correctly uses the team's preferred explicit boolean comparison style with std::nullopt for checking undefined environment variables.

velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (1)

27-31: Well-designed class declaration and constructor.

The use of explicit constructor and proper delegation to the base class constructor follows best practices.

@anlowee anlowee requested a review from gibber9809 August 1, 2025 21:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ccce2d6 and 1f10099.

📒 Files selected for processing (1)
  • velox/connectors/clp/tests/ClpConfigTest.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
📚 Learning: in velox/connectors/clp/clpconfig.cpp, the `parseconfigandexportauthenvironmentvariables()` method i...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.

Applied to files:

  • velox/connectors/clp/tests/ClpConfigTest.cpp
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.223Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.

Applied to files:

  • velox/connectors/clp/tests/ClpConfigTest.cpp
🪛 GitHub Actions: Run Checks
velox/connectors/clp/tests/ClpConfigTest.cpp

[error] 197-197: clang-format formatting check failed. The file was modified by the clang-format hook to fix formatting issues.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (6)
velox/connectors/clp/tests/ClpConfigTest.cpp (6)

1-25: LGTM!

Copyright header and includes are properly structured and complete for the test functionality.


30-125: Excellent cross-platform environment variable management.

The implementation properly handles environment variable isolation across Windows and Unix-like systems, addressing the test pollution concerns from previous reviews. The logic correctly saves original state and restores it after tests.


127-184: Well-structured test fixture hierarchy.

The test fixtures provide good separation of concerns and the environment variable checking logic properly handles null cases. The RAII pattern for environment variable management is correctly implemented.


188-194: LGTM!

The test correctly validates that unsupported auth provider types trigger appropriate exceptions with specific error messages.


196-203: Good test coverage for case insensitive auth provider handling.

The test effectively validates that mixed case auth provider names are handled correctly, which is important for user-friendly configuration.


205-239: Comprehensive test coverage for AWS environment variable export.

The test thoroughly validates both scenarios (with and without session token) and correctly uses the team's preferred explicit boolean comparison style. The logic properly verifies that configuration values are exported as environment variables.

Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly requesting changes to some docstrings + method names, but otherwise looks good.

Per side-bar we're sticking with this approach for now, but it would be good to have clp-s expose an auth provider interface so that we have the option to pass auth tokens more directly.


namespace facebook::velox::connector::clp {

void ClpS3AuthProviderBase::setupEnvironmentVariable(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename this to setEnvironmentVariable?

Comment on lines 41 to 50
/// Retrieves some customized config option value and save them to the class
/// property. Also exports the three environment variables needed by CLP-s
/// to system:
/// AWS_ENV_VAR_ACCESS_KEY_ID
/// AWS_ENV_VAR_SECRET_ACCESS_KEY
/// AWS_ENV_VAR_SESSION_TOKEN (optional)
/// So that in runtime, CLP-s’s code can execute S3-related logic correctly.
///
/// @return Did exportation succeed or not.
virtual bool parseConfigAndExportAuthEnvironmentVariables() = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the fact that this method parses config is an implementation detail, and that the important part is that auth providers should export those three environment variables.

Maybe we can rename the method to exportAuthEnvironmentVariables and change the description to something like:

/// Exports the three environment variables needed by CLP-S:
///    AWS_ENV_VAR_ACCESS_KEY_ID
///    ....
/// So that at runtime CLP-S' code can authenticate with S3.
/// @return Whether environment variable export succeeded or not.

Also one question, why do they all have ENV_VAR in the name (i.e. AWS_ENV_VAR_ACCESS_KEY_ID vs AWS_ACCESS_KEY_ID)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion, will apply. For the ENV_VAR I copied them from Haiqi's s3_utils.py. I will remove the ENV_VAR here, because they are not the actual exported env var's names and misleading.


namespace {

class ClpEnvironmentVariableGuard {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't read this class' implementation in detail but it looks generally correct.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
velox/connectors/clp/ClpConfig.cpp (1)

50-64: Constructor modifies global environment state

The constructor calls exportAuthEnvironmentVariables() which modifies the global process environment. This could cause issues if multiple ClpConfig instances are created or if initialization fails partway through.

Consider documenting that this constructor has global side effects in the header file.

velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (2)

21-23: Remove redundant ConfigBase forward declaration

The ConfigBase forward declaration is unnecessary because it's already declared in the included ClpS3AuthProviderBase.h.

-namespace facebook::velox::config {
-class ConfigBase;
-} // namespace facebook::velox::config
-

42-44: Add documentation for virtual function overrides

+  /// Constructs a full S3 URL from the given split path.
+  /// @param splitPath The original split path from the coordinator
+  /// @return The constructed S3 URL with proper endpoint and authentication
   std::string constructS3Url(std::string_view splitPath) override;

+  /// Parses S3 configuration and exports required AWS environment variables.
+  /// @return true if configuration is valid and environment variables are exported successfully
   bool exportAuthEnvironmentVariables() override;
velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (1)

41-49: Consider more generic documentation for the base class.

The documentation mentions specific AWS environment variable names, which might be too implementation-specific for a base class. Consider making it more generic to allow for different authentication schemes.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc2e52 and a3f155d.

📒 Files selected for processing (7)
  • velox/connectors/clp/ClpConfig.cpp (2 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (1 hunks)
  • velox/connectors/clp/tests/ClpConfigTest.cpp (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.
📚 Learning: in velox/connectors/clp/clpconfig.cpp, the `parseconfigandexportauthenvironmentvariables()` method i...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/ClpConfig.cpp:63-63
Timestamp: 2025-08-01T20:59:54.903Z
Learning: In velox/connectors/clp/ClpConfig.cpp, the `parseConfigAndExportAuthEnvironmentVariables()` method intentionally combines parsing config and exporting environment variables in a single method to provide flexibility for users implementing custom authentication providers. Splitting this method would be less convenient for users since they would typically call both operations together anyway.

Applied to files:

  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h
  • velox/connectors/clp/tests/ClpConnectorTest.cpp
  • velox/connectors/clp/ClpConfig.cpp
  • velox/connectors/clp/tests/ClpConfigTest.cpp
  • velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp
  • velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
📚 Learning: in the velox clp connector codebase, the team prefers using explicit boolean comparisons like `false...
Learnt from: anlowee
PR: y-scope/velox#21
File: velox/connectors/clp/tests/ClpConfigTest.cpp:114-117
Timestamp: 2025-08-01T19:58:22.223Z
Learning: In the Velox CLP connector codebase, the team prefers using explicit boolean comparisons like `false == function()` instead of `!function()` for better visibility and to prevent people from overlooking the negation operator.

Applied to files:

  • velox/connectors/clp/tests/ClpConfigTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (14)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)

25-25: Test configuration changes look good

The addition of S3 authentication configuration parameters with dummy values is appropriate for test setup.

Also applies to: 57-63

velox/connectors/clp/search_lib/ClpS3AuthProviderBase.cpp (2)

29-51: Environment variable management implementation looks good

The platform-specific implementation with proper error handling and sanity checks is well done.


53-71: Unset operation properly implemented

The unset functionality with platform-specific handling and verification is correct.

velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (2)

23-27: URL construction implementation is correct

Good validation of splitPath and straightforward URL construction.


29-54: Environment variable export properly implemented

The implementation correctly:

  • Validates all required configuration parameters
  • Handles endpoint normalization
  • Sets/unsets environment variables appropriately
  • Avoids logging sensitive credentials

All previous security and validation concerns have been addressed.

velox/connectors/clp/search_lib/ClpS3AuthProviderBase.h (3)

1-25: Clean header structure with minimal dependencies.

Good use of forward declaration for ConfigBase to reduce compilation dependencies.


28-33: Well-designed base class structure.

The explicit constructor and defaulted virtual destructor follow best practices for abstract base classes.


51-69: Well-designed helper methods for environment variable management.

The static helper methods provide a clean abstraction for platform-specific operations with clear documentation.

velox/connectors/clp/tests/ClpConfigTest.cpp (6)

30-125: Excellent implementation of environment variable isolation.

This properly addresses the concern about test environment pollution by saving and restoring the original environment state. The platform-specific handling is comprehensive.


127-149: Well-structured test fixture with proper cleanup.

The integration of ClpEnvironmentVariableGuard in SetUp/TearDown ensures test isolation.


153-172: Proper null handling implemented.

This implementation correctly addresses the previous concern about null pointer dereference when std::getenv returns nullptr.


187-193: Clean test for unsupported auth provider.

The test correctly validates exception handling for unsupported auth providers.


195-202: Good test coverage for case-insensitive parsing.

This test ensures auth provider names are handled case-insensitively, which improves usability.


204-236: Comprehensive test coverage for environment variable export.

The test properly validates both scenarios (with and without session token) and correctly verifies that the session token environment variable is unset when not provided in the configuration.

@anlowee anlowee requested a review from gibber9809 August 5, 2025 18:17
gibber9809
gibber9809 previously approved these changes Aug 5, 2025
Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! PR title is fine, but maybe change "configs" to "config".

@anlowee anlowee changed the title feat: Add S3 configs to support S3 search. feat: Add S3 config to support S3 search. Aug 5, 2025
@kirkrodrigues
Copy link
Member

  1. Where are the docs for the new config? Are we going to add docs in another PR?
  2. At least we should document the new options and the rationale behind them in the PR description, right?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44caed7 and 09db05b.

📒 Files selected for processing (1)
  • velox/docs/configs.rst (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
velox/docs/develop/connectors.rst (1)

154-169: Tighten wording and add cross-reference; fix “CLP-s” phrasing

Minor editorial tweaks improve clarity and consistency.

-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
-   variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
-   correctly. This function is invoked immediately after the configuration is parsed.
+1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
+   variables required by CLP processes. This ensures that, at runtime, CLP components can execute S3-related operations
+   correctly. This function is invoked immediately after the configuration is parsed.
@@
-Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration
-options for passing any required information.
+Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration
+options for passing any required information. See also the CLP Connector section in :doc:`configs` for related
+configuration keys (e.g., ``clp.s3-auth-provider`` and ``clp.s3-*``).
♻️ Duplicate comments (5)
velox/docs/configs.rst (5)

913-925: Scope note (workers, S3-only) is good; clarify value case-sensitivity and usage timing

  • You already state it’s required when clp.storage-type = S3. Consider adding that this is used on workers.
  • Please clarify whether the allowed value is case-sensitive in the implementation.
-     - Specifies the S3 auth provider used by CLP to obtain authentication information (typically
+     - Specifies the S3 auth provider used by CLP to obtain authentication information on workers (typically
        the three environment variables ``AWS_ACCESS_KEY_ID``, ``AWS_SECRET_ACCESS_KEY``, and the
        optional ``AWS_SESSION_TOKEN``). This option is required when the ``clp.storage-type``
@@
-       in the config file. **Allowed values:** ``CLP_PACKAGE``.
+       in the config file. **Allowed values:** ``CLP_PACKAGE`` (clarify whether value matching is case-sensitive).

To verify case-sensitivity and any default handling, run:

#!/bin/bash
# Inspect how clp.s3-auth-provider is parsed and compared.
rg -n $'clp\\.s3-auth-provider' -A 5 -B 5
rg -n $'CLP_PACKAGE' -A 3 -B 3
# Look for toLower/strcasecmp or similar to infer case handling.
rg -n $'toLower|tolower|strcasecmp|equalsIgnoreCase' -A 2 -B 2

925-930: Add a brief security note to discourage plaintext credentials

Reinforce safe handling for the access key ID.

-     - The S3 access key ID, exported as the ``AWS_ACCESS_KEY_ID`` environment variable and used
-       for authentication.
+     - The S3 access key ID, exported as the ``AWS_ACCESS_KEY_ID`` environment variable and used
+       for authentication. Avoid storing credentials in plaintext configs; prefer secret managers
+       or environment injection and ensure strict file permissions if used.

930-935: Clarify S3 endpoint format (scheme, trailing slash) and keep key name consistent

Ensure operators provide a correctly formatted endpoint and avoid accidental double slashes.

-     - The S3 endpoint. For the CLP package, the complete split URL is formatted as
-       ``<clp.s3-end-point>/<split_path>``, where ``<split_path>`` is obtained from the CLP
-       package's metadata database and stored in the ``ClpConnectorSplit`` class.
+     - The S3 endpoint (must include scheme and host, e.g., ``https://s3.amazonaws.com``; omit trailing slash).
+       For the CLP package, the full split URL is constructed as:
+       ``<clp.s3-end-point>/<split_path>``, where ``<split_path>`` is obtained from the CLP
+       package’s metadata database and stored in the ``ClpConnectorSplit`` class.

936-941: Mark secret access key as sensitive and advise secure handling

Call out that this should not be stored in plaintext.

-     - The S3 secret access key, exported as the ``AWS_SECRET_ACCESS_KEY`` environment variable
-       and used for authentication.
+     - The S3 secret access key, exported as the ``AWS_SECRET_ACCESS_KEY`` environment variable
+       and used for authentication. Treat as sensitive; avoid plaintext storage and prefer
+       secure secret delivery mechanisms.

942-945: Clarify session token usage and rotation expectations

Note this is typically temporary and not auto-refreshed.

-     - The S3 session token, exported as the ``AWS_SESSION_TOKEN`` environment variable and used
-       for authentication. This option is optional.
+     - The S3 session token, exported as the ``AWS_SESSION_TOKEN`` environment variable and used
+       for authentication. Optional. Typically used with temporary credentials (e.g., STS) and
+       not auto-refreshed; ensure timely rotation if tokens expire.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09db05b and bc60f10.

📒 Files selected for processing (2)
  • velox/docs/configs.rst (1 hunks)
  • velox/docs/develop/connectors.rst (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (3)
velox/docs/develop/connectors.rst (3)

127-131: CLP connector overview reads well

The additions clearly set context and introduce the S3 auth provider interface. No blocking issues here.


144-148: LGTM: init flow description

Good clarity on recording query and split source and mapping output columns (including nested row types).


149-153: LGTM: split processing pipeline

Accurately reflects cursor creation from split path and lazy vector loading behaviour.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
velox/docs/develop/connectors.rst (1)

156-170: Document the default provider and its config keys with an example

Given reviewer requests for config docs, add a short subsection describing the built-in provider and relevant keys.

 ClpS3AuthProviderBase
 ~~~~~~~~~~~~~~~~~~~~~
@@
 Additionally, this interface maintains a reference to ``config_``, enabling users to define custom configuration
 options for passing any required information.
+
+Default implementation: ``ClpPackageS3AuthProvider``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+When ``clp.s3-auth-provider`` is set to ``S3``, the built-in ``ClpPackageS3AuthProvider`` is used. It recognises the
+following configuration keys:
+
+- ``clp.s3-access-key-id`` → exported as ``AWS_ACCESS_KEY_ID``
+- ``clp.s3-secret-access-key`` → exported as ``AWS_SECRET_ACCESS_KEY``
+- ``clp.s3-session-token`` (optional) → exported as ``AWS_SESSION_TOKEN``
+- ``clp.s3-end-point`` → used to format the split URL (e.g., ``https://minio:9000``)
+
+Example:
+
+.. code-block:: ini
+
+   # Connector configuration
+   clp.s3-auth-provider = S3
+   clp.s3-access-key-id = <your-access-key-id>
+   clp.s3-secret-access-key = <your-secret-access-key>
+   # Optional for temporary credentials:
+   clp.s3-session-token = <your-session-token>
+   # S3-compatible endpoint (AWS S3, MinIO, etc.)
+   clp.s3-end-point = https://minio:9000
♻️ Duplicate comments (1)
velox/docs/develop/connectors.rst (1)

134-137: Clarify “partial URL” with concrete examples

Add explicit examples for complete vs partial S3 paths to remove ambiguity.

-``ClpConnectorSplit`` describes a data chunk using ``path``. This path may be the absolute file path to the split file
-if it is stored on a local file system, or the complete (or partial) URL of the split if it is stored on S3. In the
-latter case, when only a partial URL is provided, ``ClpS3AuthProviderBase`` provides a hook in ``ClpDataSource`` to
-assist in constructing the full URL. Refer to :ref:`ClpS3AuthProviderBase<ClpS3AuthProviderBase>` for details.
+``ClpConnectorSplit`` describes a data chunk using ``path``. This path may be the absolute file path to the split file
+if it is stored on a local file system, or the complete (or partial) URL of the split if it is stored on S3. For example,
+complete: ``s3://bucket/prefix/object.clp``, partial: ``bucket/prefix/object.clp``. When a partial URL is provided,
+``ClpS3AuthProviderBase`` provides a hook in ``ClpDataSource`` to construct the full URL. Refer to
+:ref:`ClpS3AuthProviderBase<ClpS3AuthProviderBase>` for details.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e62d41 and 838ab7f.

📒 Files selected for processing (2)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
  • velox/docs/develop/connectors.rst (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (2)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)

56-56: Ignore explicit CLP storage‐type configuration in tests

The CLP tests already pass an empty config and ClpConfig defaults storage‐type to “FS” (local file system). Since S3 logic only kicks in when storage‐type is “S3”, none of the AWS environment variables will ever be used in these tests. No changes are needed.

Likely an incorrect or invalid review comment.

velox/docs/develop/connectors.rst (1)

144-148: Replace “KQL query” with neutral term unless KQL is intended

Is the query language actually KQL here? If not, use a neutral term to avoid confusion.

-During initialization, it records the KQL query and split source (S3 or local). It then iterates through
+During initialization, it records the query and split source (S3 or local). It then iterates through

std::make_shared<config::ConfigBase>(
std::unordered_map<std::string, std::string>{
{"clp.split-source", "local"}}));
std::unordered_map<std::string, std::string>{}));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Nit: include <unordered_map> explicitly.

You now reference std::unordered_map directly on this line. Consider adding the header to make the dependency explicit and avoid relying on transitive includes.

Add near the other includes:

#include <unordered_map>
🤖 Prompt for AI Agents
In velox/connectors/clp/tests/ClpConnectorTest.cpp around line 56, the test uses
std::unordered_map but the header <unordered_map> is not explicitly included;
add the line '#include <unordered_map>' alongside the other include directives
at the top of the file so the dependency is explicit and does not rely on
transitive includes.

Comment on lines +127 to 131
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Surface the new config selector up-front

Briefly mention that authentication and URL construction are driven by the new config key so readers know where to look.

-Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
+Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
+Authentication and S3 URL construction are selected via the ``clp.s3-auth-provider`` configuration key.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
The CLP Connector is used to read CLP splits stored on a local file system or S3. It implements similar interfaces as
the Hive Connector except for the ``DataSink`` interface. Here we only describe the ``DataSource`` interface and the
``ConnectorSplit`` interface implementation since ``Connector`` and ``ConnectorFactory`` are similar to the Hive
Connector. We also describe ``ClpS3AuthProviderBase``, an interface that allows users to customize S3 authentication.
Authentication and S3 URL construction are selected via the ``clp.s3-auth-provider`` configuration key.
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 127 to 131, the introduction
to the CLP Connector should briefly call out that authentication and URL
construction are driven by the new config selector key so readers know where to
look; add one clear sentence after the first line stating that the connector
uses a new config key (mention its exact name if available) to select S3
authentication and to construct S3/FS URLs, and point readers to the
ClpS3AuthProviderBase section for customization details.

Comment on lines +149 to +153
When a split is added, a ``ClpCursor`` is created with the split path and input source. The query is parsed
and simplified into an AST. On ``next``, the cursor finds matching row indices and, if any exist,
``ClpDataSource`` recursively creates a row vector composed of lazy vectors, which use CLP column readers to
decode and load data as needed during execution.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional: briefly link or reference “lazy vectors” for readers unfamiliar with the concept

If there’s an existing Velox doc section on LazyVector, consider adding a cross-reference. Otherwise, one short clause like “(see LazyVector)” helps orient readers.

🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 149 to 153, add a brief
reference to the LazyVector concept for readers unfamiliar with it; update the
sentence that mentions "lazy vectors" to include a short parenthetical
cross-reference like "(see LazyVector)" or, if a LazyVector section exists, link
to it (e.g., "see :ref:`lazy-vector`" or the appropriate doc anchor). Ensure the
wording is concise and non-disruptive, and use the existing documentation
cross-reference style used elsewhere in the docs.

Comment on lines +161 to +163
1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
correctly. This function is invoked immediately after the configuration is parsed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Clarify env vars and optionality; define “CLP-s” or avoid the term

Spell out which variables are exported and note that the session token is optional. Also consider defining “CLP-s” upon first use, or replace with “CLP search worker” if that’s what is meant.

-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
-   variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
-   correctly. This function is invoked immediately after the configuration is parsed.
+1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the required AWS
+   credentials to the environment: ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` (and ``AWS_SESSION_TOKEN``
+   when using temporary credentials). This ensures that, at runtime, the CLP search worker can execute S3-related
+   operations correctly. This function is invoked immediately after the configuration is parsed.

🏁 Script executed:

#!/bin/bash
# Locate usage and implementation of exportAuthEnvironmentVariables in ClpConfig.cpp
rg -n "exportAuthEnvironmentVariables" -n velox/connectors/clp/ClpConfig.cpp

Length of output: 147


🏁 Script executed:

#!/bin/bash
# Search for exportAuthEnvironmentVariables implementation with context
rg -C5 "exportAuthEnvironmentVariables" -n velox/connectors

Length of output: 6354


🏁 Script executed:

#!/bin/bash
# Confirm the AWS access key environment variable name
rg -n "kEnvAwsAccessKeyId" -n velox/connectors/clp

Length of output: 429


Clarify exportAuthEnvironmentVariables docs

Spell out exactly which AWS environment variables are set and replace “CLP-s” with “CLP search worker” for clarity.

In velox/docs/develop/connectors.rst (lines 161–163), update the snippet to:

-1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the three environment
-   variables required by CLP-s to the system. This ensures that, at runtime, CLP-s can execute S3-related operations
-   correctly. This function is invoked immediately after the configuration is parsed.
+1. ``exportAuthEnvironmentVariables()`` – Parses user-defined configuration options and exports the required AWS
+   credentials to the environment: ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY``
+   (and ``AWS_SESSION_TOKEN`` when using temporary credentials). This ensures that, at runtime, the CLP search
+   worker can execute S3-related operations correctly. This function is invoked immediately after the configuration
+   is parsed.
🤖 Prompt for AI Agents
In velox/docs/develop/connectors.rst around lines 161 to 163, the docstring for
exportAuthEnvironmentVariables should explicitly list the three AWS environment
variables being exported and replace the term “CLP-s” with “CLP search worker”;
update the text to say that exportAuthEnvironmentVariables() exports
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN to the
environment so the CLP search worker can perform S3 operations, and keep the
note that this function is invoked immediately after configuration parsing.

Comment on lines +165 to +167
as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``,
which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Use non-ambiguous S3 URL examples (virtual-hosted vs path-style) and acknowledge custom endpoints

Current example may imply a single canonical form. Suggest showing both common styles and noting custom endpoints (e.g., MinIO).

-    as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``,
-    which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL.
+    as its argument, allowing customization in URL construction. For example, the ``path`` could be
+    ``"bucket/prefix/object.clp"`` or simply ``"prefix/object.clp"``. The provider may construct the full URL using
+    virtual-hosted-style (e.g., ``https://bucket.s3.<region>.amazonaws.com/prefix/object.clp``), path-style
+    (e.g., ``https://s3.<region>.amazonaws.com/bucket/prefix/object.clp``), or a custom endpoint (e.g., MinIO).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
as its argument, allowing customization in URL construction. For example, the ``path`` could be ``"prefix/split"``,
which must be prefixed with ``"https://bucket.s3.region.amazonaws.com/"`` to form the complete URL.
as its argument, allowing customization in URL construction. For example, the ``path`` could be
``"bucket/prefix/object.clp"`` or simply ``"prefix/object.clp"``. The provider may construct the full URL using
virtual-hosted-style (e.g., ``https://bucket.s3.<region>.amazonaws.com/prefix/object.clp``), path-style
(e.g., ``https://s3.<region>.amazonaws.com/bucket/prefix/object.clp``), or a custom endpoint (e.g., MinIO).

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving due to time constraints. Will open another PR to fix things later.

For the PR title, how about:

feat(clp-connector): Add config to support searching archives from S3.

@anlowee anlowee changed the title feat: Add S3 config to support S3 search. feat(clp-connector): Add config to support searching archives from S3. Aug 12, 2025
@anlowee anlowee merged commit d78134d into y-scope:presto-0.293-clp-connector Aug 12, 2025
9 checks passed
@anlowee anlowee deleted the xwei/s3-config branch August 12, 2025 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants