Skip to content

Conversation

sitaowang1998
Copy link
Contributor

@sitaowang1998 sitaowang1998 commented Oct 14, 2025

Description

This PR adds pydantic annotated serializer for pathlib.Path and StrEnum to avoid adding custom model serialization for fields of these types.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Celery compression works.

Summary by CodeRabbit

  • Improvements

    • Standardized string-serialized enums and unified path handling across configuration for more consistent cross‑platform behaviour and stronger validation.
  • New Features

    • Added explicit serialization utilities for enums and filesystem paths to stabilise exported/portable formats.
  • Refactor

    • Simplified configuration serialization by removing legacy primitive‑dict export paths and updating initialization/validation semantics.

junhaoliao and others added 23 commits September 29, 2025 22:46
…ost type and removing duplicated validators.
@sitaowang1998 sitaowang1998 requested a review from a team as a code owner October 14, 2025 15:39
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds string-serialized enum and Path type aliases, new serialization utilities, replaces enum and pathlib.Path fields across config/storage models with annotated aliases, removes many custom primitive-dump methods, and updates validation/initialization and enum comparison logic.

Changes

Cohort / File(s) Summary
Serialization utilities & type aliases
components/clp-py-utils/clp_py_utils/serialization_utils.py, components/clp-py-utils/clp_py_utils/clp_config.py
Added serialize_str_enum(member: StrEnum) -> str and serialize_path(path: pathlib.Path) -> str. Added StrEnumSerializer = PlainSerializer(serialize_str_enum) and SerializablePath = Annotated[pathlib.Path, PlainSerializer(serialize_path)].
String-serialized enum aliases
components/clp-py-utils/clp_py_utils/clp_config.py
Introduced StorageEngineStr, DatabaseEngineStr, QueryEngineStr, AwsAuthTypeStr aliases and applied them to model fields.
Model enum field updates
components/clp-py-utils/clp_py_utils/clp_config.py
Replaced enum-typed fields with string-serialized aliases (e.g., Package.storage_engine, Package.query_engine, Database.type, AwsAuthentication.type) and replaced .value usage with direct enum comparisons.
Model path field updates
components/clp-py-utils/clp_py_utils/clp_config.py
Replaced pathlib.Path fields with SerializablePath for storage and config attributes (e.g., FsStorage.directory, S3Storage.staging_directory, ingestion/storage/output configs, CLPConfig path attributes and private path attrs).
Validation, init and path handling
components/clp-py-utils/clp_py_utils/clp_config.py
Updated signatures and logic for expand_profile_user_home, validate_aws_config_dir, path normalization, getters/setters and related initialization to operate on annotated Path and enum aliases.
Removed/disabled primitive dump methods
components/clp-py-utils/clp_py_utils/clp_config.py
Removed multiple dump_to_primitive_dict implementations across models (Package, storage types, Archive/Stream outputs, CLPConfig, WorkerConfig) in favour of serializer wiring.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Model as PydanticModel
  participant Serializer as serialization_utils
  rect rgba(135,206,250,0.08)
    User->>Model: instantiate(raw input with enums/paths)
    activate Model
    Model->>Model: validate/normalize fields -> StrEnum / SerializablePath
    Model-->>User: constructed instance
    deactivate Model
  end
  rect rgba(144,238,144,0.07)
    User->>Model: request serialization
    activate Model
    Model->>Serializer: serialize_str_enum / serialize_path
    Serializer-->>User: primitive strings
    deactivate Model
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Simplify StrEnum and Path serialization via Annotated serializers" directly corresponds to the core objective stated in the PR description: to add Pydantic Annotated serializers for pathlib.Path and StrEnum to simplify serialization. The title accurately reflects the main changes, which include introducing serialize_str_enum and serialize_path functions, creating type aliases (StorageEngineStr, QueryEngineStr, DatabaseEngineStr, AwsAuthTypeStr, and SerializablePath), and applying these serializers across multiple configuration models. The title is concise, specific, and avoids generic terminology, making it clear to reviewers what primary change the PR addresses.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b6c8d and 80a7164.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/clp_config.py (14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)
  • serialize_path (16-23)
  • serialize_str_enum (6-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (8)
components/clp-py-utils/clp_py_utils/clp_config.py (8)

13-13: LGTM! Clean implementation of serialization infrastructure.

The imports and type alias organization follow the suggested structure, with serializers defined first followed by type aliases. The SerializablePath and StrEnumSerializer provide a clean foundation for eliminating custom serialization logic throughout the codebase.

Also applies to: 25-25, 103-114


122-122: Excellent consistency in enum type aliases.

The uniform naming pattern ([EnumName]Str) and consistent application of StrEnumSerializer provides a clean, maintainable approach to enum serialization across all enum types.

Also applies to: 130-130, 139-139, 154-154


158-159: Field type updates correctly applied.

The annotated enum types maintain backward compatibility while enabling automatic string serialization. The validator logic at lines 161-181 will continue to work correctly since the Annotated wrapper doesn't affect runtime enum comparison behavior.


185-185: Consistent application of annotated types across all models.

All enum and path fields have been systematically updated to use their respective annotated serializers. The existing validators and path manipulation methods remain compatible since Annotated types don't change runtime behavior.

Also applies to: 373-373, 419-419, 434-434, 457-457, 461-461, 465-465, 469-469, 473-473


575-575: All path fields properly annotated in CLPConfig.

The consistent use of SerializablePath across all path fields (including private attributes) improves type safety and maintainability. The automatic serialization will simplify the dump_to_primitive_dict method.

Also applies to: 581-583, 585-588


590-597: Validator signature correctly updated with return type annotation.

The parameter and return types are consistent, and the return type annotation has been properly added to address previous review feedback.


746-746: WorkerConfig path field updated for consistency.

The data_directory field now uses SerializablePath, maintaining consistency with the CLPConfig model and enabling automatic serialization.


680-680: Enum comparison works correctly with annotated types.

The comparison between AwsAuthType.profile and auth.type (which is AwsAuthTypeStr) functions correctly because the Annotated wrapper only affects serialization behavior, not runtime type comparison.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

672-672: Fix broken enum comparison after type change.

After changing AwsAuthentication.type to AwsAuthTypeStr (an enum with string serialization), the comparison at Line 672 is now broken. The code compares AwsAuthType.profile.value (a string) with auth.type (an AwsAuthType enum object), which will always evaluate to False. This breaks the validation logic that checks whether profile authentication is used.

Apply this diff to fix the comparison:

-        if AwsAuthType.profile.value == auth.type:
+        if AwsAuthType.profile == auth.type:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77b122c and 3a57db7.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (9 hunks)
  • components/clp-py-utils/clp_py_utils/pydantic_serialization_utils.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
🔇 Additional comments (8)
components/clp-py-utils/clp_py_utils/pydantic_serialization_utils.py (2)

6-6: LGTM!

The StrEnumSerializer correctly extracts the string value from enum instances for serialization, enabling clean enum-to-string conversion in Pydantic models.


8-8: LGTM!

The PathStr type correctly serializes pathlib.Path objects as strings, eliminating the need for custom serialization methods in models.

components/clp-py-utils/clp_py_utils/clp_config.py (6)

24-24: LGTM!

The import correctly brings in the new serialization utilities that enable cleaner, more declarative model serialization.


118-118: LGTM!

The string-serialized enum type aliases correctly wrap the base enum types with StrEnumSerializer, ensuring enums are serialized as their string values rather than enum objects.

Also applies to: 126-126, 135-135, 150-150


154-155: LGTM!

The field type updates correctly use the string-serialized enum aliases, maintaining enum type safety at runtime while enabling automatic string serialization.

Also applies to: 181-181, 369-369


373-398: LGTM!

The pre-validation hook correctly validates the authentication configuration, ensuring that required fields are set based on the authentication type and preventing invalid combinations.


415-415: LGTM!

The path fields correctly use PathStr to enable automatic string serialization, eliminating the need for custom serialization methods.

Also applies to: 430-430, 453-453, 457-457, 461-461, 465-465, 469-469


571-571: LGTM!

The CLPConfig path fields correctly use PathStr for automatic string serialization, including the private attributes.

Also applies to: 577-582

Comment on lines 1 to 8
import pathlib
from typing import Annotated

from pydantic import PlainSerializer

StrEnumSerializer = PlainSerializer(lambda enum_value: enum_value.value)

PathStr = Annotated[pathlib.Path, PlainSerializer(lambda path_value: str(path_value))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding docstrings for public utilities.

Adding a module docstring and docstrings for StrEnumSerializer and PathStr would improve discoverability and maintainability. Consider also adding an __all__ export list to explicitly declare the public API.

Apply this diff to add documentation:

+"""
+Pydantic serialization utilities for common types.
+
+Provides annotated types and serializers to handle enum and path serialization
+without requiring custom model serialization methods.
+"""
+
 import pathlib
 from typing import Annotated
 
 from pydantic import PlainSerializer
 
+# Serializer that converts enum instances to their string values
 StrEnumSerializer = PlainSerializer(lambda enum_value: enum_value.value)
 
+# Annotated type for pathlib.Path that serializes to string
 PathStr = Annotated[pathlib.Path, PlainSerializer(lambda path_value: str(path_value))]
+
+__all__ = ["StrEnumSerializer", "PathStr"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import pathlib
from typing import Annotated
from pydantic import PlainSerializer
StrEnumSerializer = PlainSerializer(lambda enum_value: enum_value.value)
PathStr = Annotated[pathlib.Path, PlainSerializer(lambda path_value: str(path_value))]
"""
Pydantic serialization utilities for common types.
Provides annotated types and serializers to handle enum and path serialization
without requiring custom model serialization methods.
"""
import pathlib
from typing import Annotated
from pydantic import PlainSerializer
# Serializer that converts enum instances to their string values
StrEnumSerializer = PlainSerializer(lambda enum_value: enum_value.value)
# Annotated type for pathlib.Path that serializes to string
PathStr = Annotated[pathlib.Path, PlainSerializer(lambda path_value: str(path_value))]
__all__ = ["StrEnumSerializer", "PathStr"]
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/pydantic_serialization_utils.py around
lines 1 to 8, add a module-level docstring describing the purpose (helpers to
create Pydantic PlainSerializers for common types), add concise
docstrings/comments for the StrEnumSerializer and PathStr explaining what they
serialize and expected input/output (e.g., StrEnumSerializer converts Enum
instances to their .value string; PathStr serializes pathlib.Path to str), and
add an __all__ list exporting ("StrEnumSerializer", "PathStr") so the public API
is explicit; keep wording short and place docstrings immediately above the
definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the file name, how about just file_utils? Though I understand the utils here likely are only compatible with Pydantic, the name "Pydantic" is still a bit implementation-specific which i feel can be omitted from the name.

Copy link
Member

@junhaoliao junhaoliao Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can we also:

  • Only export the lambda functions in serialization_utils.py: serialize_path = lambda ...
  • Create the Pydantic types in clp_config.py (or move them into a separate types.py source file. maybe in a future refactoring PR) e.g.,
# in clp_config.py
PathStr = Annotated[
    pathlib.Path,
    PlainSerializer(serialize_path)
]

so that we can decouples the serialization utils from Pydantic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel TypeStr is not a very intuitive name. Is there a better suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with another name, but file_utils is confusing as it is not even close to what this file is doing.

Comment on lines 581 to 582
_container_image_id_path: PathStr = PrivateAttr(default=CLP_PACKAGE_CONTAINER_IMAGE_ID_PATH)
_version_file_path: PathStr = PrivateAttr(default=CLP_VERSION_FILE_PATH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just to show we have considered -)

Since these are PrivateAttr, they won't serialize by default anyway. The type change is harmless though.


StrEnumSerializer = PlainSerializer(lambda enum_value: enum_value.value)

PathStr = Annotated[pathlib.Path, PlainSerializer(lambda path_value: str(path_value))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i'm not sure if this is worth doing but we shall investigate)

can we embed the validator _validate_directory with BeforeValidator() into the Annotation as well?
https://docs.pydantic.dev/latest/api/functional_validators/#pydantic.functional_validators.BeforeValidator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried. Somehow annotated BeforeValidator is not triggered. I haven't figured out why. Maybe we can ask coderabbit to open an issue to track this?

@sitaowang1998 sitaowang1998 changed the title refactor(cl-package): Use annoated serializer to avoid custom serialization. refactor(clp-package): Use annoated serializer to avoid custom serialization. Oct 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e487f82 and da5e3e7.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (10 hunks)
  • components/clp-py-utils/clp_py_utils/serialization_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)
  • serialize_enum (5-12)
  • serialize_path (15-22)
components/clp-py-utils/clp_py_utils/core.py (1)
  • make_config_path_absolute (42-53)
🪛 Ruff (0.14.0)
components/clp-py-utils/clp_py_utils/clp_config.py

119-119: Missing return type annotation for private function _validate_directory

(ANN202)


119-119: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


129-129: Prefer TypeError exception for invalid type

(TRY004)


129-129: Avoid specifying long messages outside the exception class

(TRY003)


132-132: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
🔇 Additional comments (14)
components/clp-py-utils/clp_py_utils/serialization_utils.py (1)

15-22: LGTM!

The implementation correctly converts a pathlib.Path to its string representation.

components/clp-py-utils/clp_py_utils/clp_config.py (13)

9-9: LGTM!

The new imports support the Pydantic annotated serializers approach for path and enum serialization.

Also applies to: 14-14, 26-26


114-116: Clean type alias design.

The StrEnumSerializer and PathStr type aliases provide a consistent, reusable pattern for serialization. Note that these depend on the serialize_enum function being fixed (see the critical bug identified in serialization_utils.py).


147-147: LGTM!

The enum type aliases consistently apply the StrEnumSerializer for automatic string serialization across all enum types.

Also applies to: 155-155, 164-164, 179-179


183-184: LGTM!

The Package model correctly uses the string-serialized enum type aliases for automatic serialization.


210-210: LGTM!

The Database model correctly uses DatabaseEngineStr for automatic string serialization.


398-398: LGTM!

The AwsAuthentication model correctly uses AwsAuthTypeStr for automatic string serialization.


444-444: LGTM!

Using ValidatedPathStr ensures the directory path is validated before being converted to pathlib.Path.


453-453: LGTM!

Using ValidatedPathStr ensures the staging directory path is validated before being converted to pathlib.Path.


470-470: Verify intentional removal of directory validation.

This line changes from ValidatedPathStr (inherited from parent FsStorage) to PathStr, which removes the validation that ensures the value is a non-empty string. Given the default value is pathlib.Path("/"), this change may be intentional to allow root directory ingestion without validation constraints.

Please confirm this is intentional. If validation should be retained, consider using ValidatedPathStr instead:

-    directory: PathStr = pathlib.Path("/")
+    directory: ValidatedPathStr = pathlib.Path("/")

474-474: LGTM!

Using PathStr without validation is appropriate here since these fields have sensible defaults and validation happens later via make_config_paths_absolute and related methods.

Also applies to: 478-478, 482-482, 486-486


588-588: LGTM!

Changing these fields to PathStr ensures consistent path serialization across the configuration model.

Also applies to: 594-595


598-599: Type change is harmless for private attributes.

As noted in previous reviews, since these are PrivateAttr, they won't serialize by default anyway. The type change to PathStr is harmless and maintains consistency.

Based on learnings


689-689: LGTM!

The comparison auth.type == AwsAuthType.profile works correctly even though auth.type is now AwsAuthTypeStr, as Pydantic preserves the underlying enum for comparisons.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

765-777: Address static analysis hints for better code quality.

The _validate_directory function should be improved per the static analysis hints already flagged in previous reviews:

  1. Add return type annotation (-> str)
  2. Use TypeError instead of ValueError when validating the type at line 774
  3. Consider extracting exception messages into constants
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da5e3e7 and c80de48.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (11 hunks)
  • components/clp-py-utils/clp_py_utils/serialization_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)
  • serialize_enum (5-12)
  • serialize_path (15-22)
🪛 GitHub Actions: clp-lint
components/clp-py-utils/clp_py_utils/serialization_utils.py

[error] 1-1: ruff: Import block is un-sorted or un-formatted (I001). 1 fixable with --fix.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
🔇 Additional comments (8)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)

5-12: LGTM! Bug fix applied correctly.

The function now correctly accesses enum_value.value as a property rather than calling it as a method.


15-22: LGTM!

The implementation correctly converts pathlib.Path to its string representation.

components/clp-py-utils/clp_py_utils/clp_config.py (6)

13-13: LGTM!

The imports are correctly added to support the annotated serializer refactoring.

Also applies to: 25-25


113-115: LGTM!

The reusable type aliases for StrEnumSerializer and PathStr are well-defined and will reduce duplication across model fields.


123-124: LGTM!

The string-serialized enum type aliases follow a consistent pattern and will ensure enums are properly serialized to their string values.

Also applies to: 131-132, 140-141, 155-156


159-160: LGTM!

The model fields are correctly updated to use string-serialized enum types, which will automatically handle serialization without custom dump_to_primitive_dict methods.

Also applies to: 186-186, 374-374


420-420: LGTM!

The path fields are correctly updated to use PathStr, which will automatically serialize paths to strings via the PlainSerializer.

Also applies to: 435-435, 458-458, 462-462, 466-466, 470-470, 474-474, 576-576, 582-583


677-677: LGTM!

The enum comparison is still correct. The AwsAuthTypeStr annotation only affects serialization; at runtime, auth.type remains an AwsAuthType instance, so the comparison works as expected.

Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

for the title, how about:

refactor(clp-package): Simplify StrEnum and Path serialization via Annotated serializers.

The "A" in "Annotated" is intentionally capitalized because it's a Python feature

return connection_params_and_type

def dump_to_primitive_dict(self):
d = self.model_dump(exclude={"username", "password"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a separate PR, we shall investigate
specifying private attributes like username as https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.PrivateAttr
and
specifying secrets like password as https://docs.pydantic.dev/2.2/usage/types/secrets/

so we can completely get rid of such dump_to_primitive_dict() methods.

@coderabbitai create an issue to track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Create a GitHub issue to track the investigation
gh issue create \
  --repo y-scope/clp \
  --title "Investigate using Pydantic PrivateAttr and SecretStr to eliminate dump_to_primitive_dict methods" \
  --body "## Context

This issue tracks a follow-up refactoring task identified during PR #1417.

## Task

Investigate using Pydantic features to eliminate the need for custom \`dump_to_primitive_dict()\` methods:

1. Use [PrivateAttr](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.PrivateAttr) for private attributes like \`username\`
2. Use [SecretStr](https://docs.pydantic.dev/2.2/usage/types/secrets/) for sensitive fields like \`password\`

This would allow complete removal of \`dump_to_primitive_dict()\` methods throughout the codebase.

## References

- PR: https://github.com/y-scope/clp/pull/1417
- Comment: https://github.com/y-scope/clp/pull/1417#discussion_r2442428522
- Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created issue #1440 to track this investigation: #1440

The issue covers exploring Pydantic's PrivateAttr for private attributes like username and SecretStr for sensitive fields like password, which would allow the removal of dump_to_primitive_dict() methods throughout the codebase.


🧬 Code Graph Analysis Results

Filename: components/clp-py-utils/clp_py_utils/clp_config.py
Line numbers: N/A (summary)

  • Summary: Private attributes used for config paths

    • _container_image_id_path: PrivateAttr with default path CLP_PACKAGE_CONTAINER_IMAGE_ID_PATH
    • _version_file_path: PrivateAttr with default path CLP_VERSION_FILE_PATH
    • Purpose: Conceal internal paths from model serialization; supports lazy/controlled access for container image and version resolution.
  • Summary: dump_to_primitive_dict methods (exclusions to avoid leaking sensitive data)

    • Database.dump_to_primitive_dict
      • Returns: dict representation of the Database model excluding username and password
      • Behavior: Ensures credentials are not dumped to primitive form
    • Redis.dump_to_primitive_dict
      • Returns: dict representation of the Redis model excluding password
    • Queue.dump_to_primitive_dict
      • Returns: dict representation of the Queue model excluding username and password
    • CLPConfig.dump_to_primitive_dict
      • Returns: composite dict where database, queue, and redis are dumped via their respective dump_to_primitive_dict methods
      • Also converts aws_config_directory to a primitive value (string or None)
      • Purpose: Centralized primitive-dict export that omits sensitive credentials
  • Summary: Related credential loading methods (context for removal of dump methods)

    • Database.load_credentials_from_file/load_credentials_from_env
    • Redis.load_credentials_from_file/load_credentials_from_env
    • Queue.load_credentials_from_file/load_credentials_from_env
    • Note: These methods indicate sensitivity of credential data, motivating the move to private attributes and excluding credentials in primitive dumps.

_version_file_path: PathStr = PrivateAttr(default=CLP_VERSION_FILE_PATH)

@field_validator("aws_config_directory")
@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we changed the above aws_config_directory type as Optional[SerializablePath], we shall update below value type as Optional[SerializablePath] in below validator prototype for consistency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked other references of pathlib.Path and believe they are fine to be left as pathlib.Path because those are used in public interfaces and the callers do not need to make the paths serializable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we have to manually serialize them in the dump_to_primitive_dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Sorry if i may have confused you, i meant to say

I checked other references of pathlib.Path and believe "all the other references" are fine to be left as pathlib.Path

-> the value argument in the validator of aws_config_directory is the ONLY argument whose type should be updated as SerializablePath
-> that is, let's not change the type from pathlib.Path to SerializablePath in any def make_config_paths_absolute(self, clp_home: pathlib.Path) or def load_credentials_from_file(self, credentials_file_path: pathlib.Path) prototypes

@sitaowang1998 sitaowang1998 changed the title refactor(clp-package): Use annoated serializer to avoid custom serialization. refactor(clp-package): Simplify StrEnum and Path serialization via Annotated serializers. Oct 18, 2025
sitaowang1998 and others added 2 commits October 18, 2025 11:15
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)

590-594: Update validator parameter type for consistency.

If aws_config_directory is changed to PathStr (as required to fix line 584), consider updating the validator's value parameter type from Optional[pathlib.Path] to Optional[PathStr] for consistency.

Based on learnings

     @field_validator("aws_config_directory")
     @classmethod
-    def expand_profile_user_home(cls, value: Optional[pathlib.Path]):
+    def expand_profile_user_home(cls, value: Optional[PathStr]):
         if value is not None:
             value = value.expanduser()
         return value

717-733: Consider removing redundant custom serialization.

If aws_config_directory is changed to PathStr as suggested above, lines 728-731 become redundant because the field will already serialize to a string automatically via the PlainSerializer. The custom serialization could then be removed.

Based on learnings

After fixing the type at line 584, apply this diff to remove the redundant serialization:

     def dump_to_primitive_dict(self):
         custom_serialized_fields = {
             "database",
             "queue",
             "redis",
         }
         d = self.model_dump(exclude=custom_serialized_fields)
         for key in custom_serialized_fields:
             d[key] = getattr(self, key).dump_to_primitive_dict()
-
-        # Turn paths into primitive strings
-        if self.aws_config_directory is not None:
-            d["aws_config_directory"] = str(self.aws_config_directory)
-        else:
-            d["aws_config_directory"] = None
 
         return d
♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

113-115: Consider renaming PathStr to SerializablePath for clarity.

The name PathStr is potentially misleading since the annotated type is pathlib.Path, not str. The PlainSerializer only affects serialization behaviour, not the runtime type. Consider renaming to SerializablePath to better reflect that it's a serializable Path type.

Based on learnings

Apply this diff if you agree with the renaming:

-StrEnumSerializer = PlainSerializer(serialize_enum)
-
-PathStr = Annotated[pathlib.Path, PlainSerializer(serialize_path)]
+StrEnumSerializer = PlainSerializer(serialize_enum)
+
+SerializablePath = Annotated[pathlib.Path, PlainSerializer(serialize_path)]

Note: This would require updating all usages of PathStr throughout the file to SerializablePath.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f9bfc9 and ca09544.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (11 hunks)
  • components/clp-py-utils/clp_py_utils/serialization_utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/clp-py-utils/clp_py_utils/serialization_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)
  • serialize_enum (6-13)
  • serialize_path (16-23)
🪛 Ruff (0.14.0)
components/clp-py-utils/clp_py_utils/clp_config.py

584-584: Undefined name SerializablePath

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
🔇 Additional comments (11)
components/clp-py-utils/clp_py_utils/clp_config.py (9)

13-13: LGTM!

The PlainSerializer import is correctly added to support the new annotated serializers.


25-25: LGTM!

The import of serialization utilities is correct and necessary for the new type aliases.


123-123: LGTM!

The string-serialized enum type aliases are correctly defined using the StrEnumSerializer and follow a consistent naming pattern.

Also applies to: 131-131, 140-140, 155-155


159-160: LGTM!

The field type changes to use the new string-serialized enum aliases are correct and will enable automatic serialization without custom methods.

Also applies to: 186-186, 374-374


420-420: LGTM!

The path fields are correctly updated to use the PathStr alias, enabling automatic string serialization.

Also applies to: 435-435, 458-458, 462-462, 466-466, 470-470, 474-474


576-576: LGTM!

The CLPConfig path fields are correctly updated to use the PathStr alias for automatic serialization.

Also applies to: 582-583


586-587: LGTM!

The private attributes are correctly typed with PathStr. As noted in past comments, since these are PrivateAttr, they won't serialize by default anyway, making the type change harmless.

Based on learnings


677-677: LGTM!

Direct access to auth.type is correct. Since type is now AwsAuthTypeStr (which serializes to string), the comparison with AwsAuthType.profile will work correctly as the underlying value is compared.


749-749: Consider updating field type to SerializablePath.

Based on past review comments, consider whether data_directory in WorkerConfig should also use SerializablePath (or PathStr) for consistency with the changes in CLPConfig.

Based on learnings

Should this field be updated to match the serialization approach used in CLPConfig?

components/clp-py-utils/clp_py_utils/serialization_utils.py (2)

1-3: LGTM!

The imports are correctly formatted with proper spacing between standard library and third-party imports.


16-23: LGTM!

The serialize_path function is correctly implemented with accurate docstring documentation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca09544 and 8e587e3.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (12 hunks)
  • components/clp-py-utils/clp_py_utils/serialization_utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/clp-py-utils/clp_py_utils/serialization_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/serialization_utils.py (2)
  • serialize_path (16-23)
  • serialize_str_enum (6-13)
🪛 Ruff (0.14.0)
components/clp-py-utils/clp_py_utils/clp_config.py

592-592: Missing return type annotation for classmethod expand_profile_user_home

(ANN206)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/clp_config.py (4)

13-13: LGTM: Serialization infrastructure properly set up.

The imports and serializer definitions are correct. The StrEnumSerializer and SerializablePath type aliases properly leverage the new serialization utilities to enable automatic string serialization of StrEnum and Path fields in Pydantic models.

Also applies to: 25-25, 103-114


122-122: LGTM: String-serialized enum aliases are well-defined.

The enum aliases (StorageEngineStr, DatabaseEngineStr, QueryEngineStr, AwsAuthTypeStr) follow a consistent naming convention and properly apply the StrEnumSerializer to enable automatic string serialization whilst preserving enum type safety during validation.

Also applies to: 130-130, 139-139, 154-154


158-159: LGTM: Model field types correctly updated to use serializable aliases.

The field type updates are consistent and correct:

  • Enum fields now use string-serialized aliases (StorageEngineStr, DatabaseEngineStr, QueryEngineStr, AwsAuthTypeStr)
  • Path fields now use SerializablePath alias for automatic string serialization
  • Private attributes also properly use SerializablePath

This enables automatic serialization without requiring custom dump_to_primitive_dict methods for these fields.

Also applies to: 185-185, 373-373, 419-419, 434-434, 457-457, 461-461, 465-465, 469-469, 473-473, 575-575, 581-583, 585-588


666-694: LGTM: Validation logic correctly uses enum object comparison.

The validation logic at line 678 correctly compares enum objects directly (AwsAuthType.profile == auth.type) rather than comparing string values. This is the proper approach when using string-serialized enum aliases, as the runtime type remains an enum object with automatic serialization behavior.

sitaowang1998 and others added 2 commits October 18, 2025 13:53
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants