Fix trainer detection for custom Docker images with regex pattern matching #31

jskswamy · 2025-06-18T12:19:57Z

Problem

Custom DeepSpeed Docker images were losing the mpirun command and falling back to torchrun instead. The get_runtime_trainer function only used a hardcoded ALL_TRAINERS mapping, so any custom images not in this mapping would default to the PyTorch trainer configuration.

Solution

Enhanced trainer detection with regex-based pattern matching as a fallback mechanism:

First priority: Check existing ALL_TRAINERS mapping for exact matches
Second priority: Use regex patterns to detect framework from image names:
- DeepSpeed: (?i)deepspeed (case-insensitive)
- MLX: (?i)mlx
- TorchTune: (?i)torchtune
- PyTorch: (?i)(pytorch|torch) (but not torchtune)
Fallback: Default to PyTorch trainer if no patterns match

Key Changes

Added _detect_trainer_from_image_patterns() function with case-insensitive regex matching
Modified _detect_trainer() to use pattern matching as fallback
Added copy.deepcopy() to prevent shared state issues between trainer configurations
Comprehensive test suite with 76 test cases covering various image name formats

Testing

✅ Known images from ALL_TRAINERS mapping
✅ Custom images with various case formats (lowercase, uppercase, mixed case)
✅ Images with registry prefixes, ports, and complex paths
✅ Edge cases and fallback scenarios
✅ Accelerator count logic with ML policies
✅ State isolation between test runs

This ensures custom DeepSpeed images like my-org/deepspeed-custom:latest correctly use mpirun instead of falling back to torchrun.

This fixes the issue #29

google-oss-prow · 2025-06-18T12:20:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eoinfennessy

@jskswamy thank you for this contribution! And thank you for also adding tests for get_container_devices.

It looks great to me -- I just added two small suggestions.

eoinfennessy · 2025-06-20T13:47:45Z

python/tests/test_utils.py

+            # Edge cases - no match (should fall back to default)
+            ("unknown-image:latest", types.TrainerFramework.TORCH),
+            ("", types.TrainerFramework.TORCH),
+            ("nginx:latest", types.TrainerFramework.TORCH),
+            ("ubuntu:20.04", types.TrainerFramework.TORCH),
+        ],
+    )
+    def test_trainer_detection_from_image_patterns(
+        self, image_name, expected_framework
+    ):
+        """Test trainer detection using image pattern matching with various case scenarios."""
+        trainer = utils._detect_trainer_from_image_patterns(image_name)
+        if expected_framework == types.TrainerFramework.TORCH and trainer is None:
+            # For unknown images, the _detect_trainer function should return default
+            # but _detect_trainer_from_image_patterns returns None
+            assert trainer is None
+        else:
+            assert trainer is not None
+            assert trainer.framework.value == expected_framework.value


Small suggestion to replace expected_framework with None for no-match cases. I think this make the behavior of the function being tested clearer for readers.

Suggested change

# Edge cases - no match (should fall back to default)

("unknown-image:latest", types.TrainerFramework.TORCH),

("", types.TrainerFramework.TORCH),

("nginx:latest", types.TrainerFramework.TORCH),

("ubuntu:20.04", types.TrainerFramework.TORCH),

],

)

def test_trainer_detection_from_image_patterns(

self, image_name, expected_framework

):

"""Test trainer detection using image pattern matching with various case scenarios."""

trainer = utils._detect_trainer_from_image_patterns(image_name)

if expected_framework == types.TrainerFramework.TORCH and trainer is None:

# For unknown images, the _detect_trainer function should return default

# but _detect_trainer_from_image_patterns returns None

assert trainer is None

else:

assert trainer is not None

assert trainer.framework.value == expected_framework.value

# Edge cases - no match

("unknown-image:latest", None),

("", None),

("nginx:latest", None),

("ubuntu:20.04", None),

],

)

def test_trainer_detection_from_image_patterns(

self, image_name, expected_framework

):

"""Test trainer detection using image pattern matching with various case scenarios."""

trainer = utils._detect_trainer_from_image_patterns(image_name)

if expected_framework is None:

# For unknown images _detect_trainer_from_image_patterns returns None

assert trainer is None

else:

assert trainer is not None

assert trainer.framework.value == expected_framework.value

eoinfennessy · 2025-06-20T13:52:40Z

python/kubeflow/trainer/types/types.py

+# Trainer framework constants for easy reference
+class TrainerFramework(Enum):
+    """Trainer framework constants."""
+    TORCH = "torch"
+    DEEPSPEED = "deepspeed"
+    MLX = "mlx"
+    TORCHTUNE = "torchtune"


This is exactly the same as the Framework enum. Can we delete this and use Framework instead?

jskswamy · 2025-06-23T09:15:53Z

@eoinfennessy made all the suggested changes.

eoinfennessy

@jskswamy LGTM! Thank you!

google-oss-prow · 2025-06-23T09:41:46Z

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

@jskswamy LGTM! Thank you!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Thank you for this contribution @jskswamy 🎉

python/uv.lock

python/tests/test_utils.py

andreyvelich · 2025-06-27T23:37:05Z

python/kubeflow/trainer/utils/utils.py

+    if ml_policy.torch and ml_policy.torch.num_proc_per_node is not None:
        num_proc = ml_policy.torch.num_proc_per_node.actual_instance
        if isinstance(num_proc, int):
            trainer.accelerator_count = num_proc
-    elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node:
+    elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None:


Why do we need to add is not None here ?

1. Torch Policy Check (if trainer_container.accelerator_count is not None)

# Essential: Prevents AttributeError when accessing None.actual_instance # Without this check: None.actual_instance would raise AttributeError if trainer_container.accelerator_count is not None: if hasattr(trainer_container.accelerator_count, 'actual_instance'): trainer.accelerator_count = trainer_container.accelerator_count.actual_instance

2. MPI Policy Check (if trainer_container.mpi_policy is not None)

# Essential: Prevents setting accelerator_count to None when user explicitly sets it # Without this check: trainer.accelerator_count would be overwritten to None if trainer_container.mpi_policy is not None: trainer.accelerator_count = trainer_container.mpi_policy.num_procs

3. Semantic Correctness

These checks ensure that:

User-provided values are preserved and not overwritten

We don't attempt operations on None objects

The logic follows "only apply changes if the field is actually set"

Code Comments Added:

I've added explanatory comments to each check to make their necessity clear for future maintainers.

@jskswamy I just meant that those 2 lines are the same in Python, isn't ?

elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node: elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None:

I think it's a subtle but important distinction in Python.

The is not None Check is Necessary

The current code is correct because 0 is a valid and meaningful value for num_proc_per_node:

# Current correct implementation elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None: trainer.accelerator_count = ml_policy.mpi.num_proc_per_node

Why Truthiness Checking Would Break CPU-Only Training

If we used truthiness checking instead:

# This would be problematic elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node: trainer.accelerator_count = ml_policy.mpi.num_proc_per_node

Example Scenarios:

Scenario 1: CPU-Only Training (0 accelerators)

ml_policy.mpi.num_proc_per_node = 0 # Explicitly set to CPU-only # With truthiness check: if ml_policy.mpi and ml_policy.mpi.num_proc_per_node: # 0 is falsy! trainer.accelerator_count = ml_policy.mpi.num_proc_per_node # ❌ Never executes # With is not None check: if ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None: # 0 is not None! trainer.accelerator_count = ml_policy.mpi.num_proc_per_node # ✅ Executes correctly

Scenario 2: GPU Training (4 accelerators)

ml_policy.mpi.num_proc_per_node = 4 # Explicitly set to 4 GPUs # Both approaches work correctly: if ml_policy.mpi and ml_policy.mpi.num_proc_per_node: # 4 is truthy ✅ if ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None: # 4 is not None ✅

Scenario 3: Not Set (defaults to UNKNOWN)

ml_policy.mpi.num_proc_per_node = None # Not explicitly set # Both approaches work correctly: if ml_policy.mpi and ml_policy.mpi.num_proc_per_node: # None is falsy ✅ if ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None: # None is None ✅

The Key Distinction

The is not None check properly distinguishes between:

"Not set" (None) → don't override accelerator count

"Explicitly set to 0" (0) → override with 0 (CPU-only training)

"Explicitly set to positive number" → override with that number

But why num_proc_per_node=0 is a valid value ?
We should not allow user to set such value or consider this as None.

@jskswamy Did you get a chance to check this comment ?

Sorry for late reply! I've addressed this, kindly check the changes now

python/kubeflow/trainer/utils/utils.py

andreyvelich · 2025-06-27T23:45:36Z

python/kubeflow/trainer/utils/utils.py

    return None


+def _detect_trainer_from_image_patterns(image_name: str) -> Optional[types.Trainer]:


@tenzen-y @Electronic-Waste @astefanutti @jskswamy @eoinfennessy @franciscojavierarceo Do we see any concerns with regex approach ? It might be a good and simple method to start with, but I can imagine use cases where it wouldn't work. For example, users might have two DeepSpeed runtimes:

One uses torchrun

Another uses mpirun.

Perhaps in the future we can support such scenarios.

I agree, there are use cases where image name patterns alone wouldn't be sufficient.

Current Regex Approach — Pragmatic Starting Point

The regex approach implemented serves as a practical starting point that:

✅ Works immediately for the majority of common use cases

✅ Supports all official Kubeflow trainer images out of the box

✅ Provides sensible defaults without requiring users to specify trainer types manually

✅ Maintains backward compatibility with existing workflows

Future API-Based Enhancement

For advanced scenarios like your DeepSpeed example (torchrun vs mpirun variants), we can introduce explicit API controls that override the regex detection:

# Option 1: Explicit trainer specification trainer = Trainer( image="custom/deepspeed-runtime", trainer_type=TrainerType.DEEPSPEED_MPI, # Override regex detection # ... other configs ) # Option 2: Runtime configuration trainer = Trainer( image="custom/deepspeed-runtime", runtime_config=DeepSpeedConfig(launcher="mpirun"), # vs "torchrun" # ... other configs )

Approach

The regex approach handles ~90% of use cases elegantly, while keeping the door open for API-based precision when needed.

Question: Which approach would you prefer to proceed with?

Option A: Keep the current regex-based detection and enhance it incrementally with API overrides when needed

Option B: Move to a more explicit API-first approach where users specify trainer types directly

Option C: Hybrid approach where regex provides defaults, but API allows explicit overrides from day one

I'm happy to implement the changes that would be most valuable for users. What are your thoughts?

Let's keep the regex approach initially, we can have discussion how to cover more complex use-cases.
We just need to ensure we document that in our docs: https://www.kubeflow.org/docs/components/trainer/user-guides/

This commit introduces optional dependencies specifically for testing purposes. The `pytest` and `pytest-mock` packages are added to the `pyproject.toml` file under the `optional-dependencies` section, allowing developers to easily install testing tools when needed. Additionally, a new `pytest.ini` configuration section is created to standardize test settings, including options for verbosity and test discovery patterns. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

This commit introduces a new enumeration `TrainerFramework` to centralize the definitions of various trainer frameworks used in the Kubeflow SDK. The trainer configurations have been refactored into a dictionary `TRAINER_CONFIGS`, which maps each framework to its respective configuration, reducing duplication and improving maintainability. Additionally, the trainer detection logic has been enhanced to utilize image name patterns for identifying the appropriate trainer framework based on the container image name. This improves the robustness of trainer type detection and ensures backward compatibility with the existing `ALL_TRAINERS` mapping. - Added `TrainerFramework` enum for trainer framework constants. - Refactored trainer configurations into `TRAINER_CONFIGS`. - Enhanced trainer detection logic to support image name patterns. - Added unit tests for the new detection logic and configurations. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Updated the TrainerFramework Enum to a more generic Framework Enum to improve code maintainability and clarity. This change simplifies the trainer configurations and associated functions by using the new Framework Enum, ensuring consistent references throughout the codebase. - Replaced TrainerFramework with Framework in types.py - Updated references in utils.py to reflect the new Enum - Adjusted test cases in test_utils.py to accommodate changes Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Refactor the test cases in `test_utils.py` to adjust the expected output for edge cases where no matching framework is found. This change ensures that the tests handle cases where the image does not correspond to any known framework by returning `None` instead of a default framework. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Move test files from tests/ directory to be co-located with source files and split types-related tests into a separate file: - tests/test_utils.py → kubeflow/trainer/utils/utils_test.py - Extract types tests → kubeflow/trainer/types/types_test.py - Update pyproject.toml testpaths: ["tests"] → ["kubeflow"] - Remove tests/ directory This improves code organization by keeping tests next to the code they validate, making it easier to maintain test coverage when modifying source files. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Remove underscore prefixes from detect_trainer_from_image_patterns() and detect_trainer() to follow established codebase conventions. Analysis shows no other utility functions in the codebase use underscore prefixes. Functions renamed: - _detect_trainer_from_image_patterns → detect_trainer_from_image_patterns - _detect_trainer → detect_trainer Update all function calls and tests accordingly. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Remove generic 'torch' pattern matching and require explicit 'pytorch' in image names for better framework distinction. This prevents ambiguity between PyTorch and other torch-related libraries. - Remove regex pattern: r'^torch(?!tune)' - Keep only: r'pytorch' for PyTorch detection - Update test case: 'torch-custom:latest' → 'pytorch-torch-custom:latest' - Add test case: 'torch-custom:latest' now returns None This ensures clearer separation between PyTorch and TorchTune images. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add detailed comments explaining why 'is not None' checks are necessary in ML policy processing: 1. For torch: prevents AttributeError when accessing None.actual_instance 2. For MPI: prevents setting accelerator_count to None 3. Semantically: only override when user explicitly provides values These checks prevent runtime errors and ensure correct behavior when ML policies have undefined num_proc_per_node values. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Eliminate ALL_TRAINERS and rely solely on regex pattern matching for trainer detection. This removes duplication between static mapping and TRAINER_CONFIGS while maintaining full functionality. - Remove ALL_TRAINERS from types.py - Simplify detect_trainer(): regex patterns → DEFAULT_TRAINER fallback - Update tests to verify official images work with regex patterns All official Kubeflow images correctly detected by regex, ensuring no breaking changes while reducing architectural complexity. The regex patterns now serve as the single source of truth. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

- Remove uv.lock file - Remove test dependencies from pyproject.toml - Remove pytest configuration from pyproject.toml - Keep only core trainer detection improvements and tests This ensures the PR focuses solely on trainer detection enhancements. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

andreyvelich · 2025-07-03T14:13:23Z

python/kubeflow/trainer/types/types.py

+TRAINER_CONFIGS: Dict[Framework, Trainer] = {
+    Framework.TORCH: Trainer(
        trainer_type=TrainerType.CUSTOM_TRAINER,
        framework=Framework.TORCH,


Do we really need to keep framework argument given that TRAINER_CONFIGS Dict has the Framework type in the Dict key.

Suggested change

framework=Framework.TORCH,

Regarding the framework field in the Trainer class, I'd like to share my thoughts on why this field exists and why it serves a legitimate purpose:

The framework Field Has Critical Importance

After investigating the codebase, I discovered that the Trainer class and framework field were pre-existing before this PR. The field was intentionally designed to serve specific purposes:

Critical Importance for API Design

The framework field is essential for maintaining a clean, self-contained API:

Object Identity: A Trainer object must "know" what framework it represents without external context

API Completeness: When users receive a Trainer object, they can immediately determine its framework without reverse-engineering from other fields

Serialization: The field is crucial for JSON serialization/deserialization of trainer objects

Debugging & Logging: Essential for meaningful error messages and debugging information

Self-Contained Data Structure

The framework field makes Trainer objects self-contained and self-documenting:

# Example: A Trainer object "knows" what framework it represents trainer = TRAINER_CONFIGS[Framework.DEEPSPEED] # Self-documenting: The object tells us what it is print(f"Using {trainer.framework} trainer with {trainer.trainer_type}") # Output: "Using Framework.DEEPSPEED trainer with TrainerType.CUSTOM_TRAINER" # Without the field, we'd need external context to know what framework this is # We'd have to track which dictionary key was used to create this trainer

Breaking Changes Would Be Required

Removing the field would require:

Modifying any code that relies on the field for framework identification

Potentially breaking API consumers who expect this field

Adding complex lookup logic to determine framework from other properties

Architectural Integrity

The field maintains the principle of encapsulation — Trainer object should contain all information about itself, including what framework it represents.

Why Dictionary Instead of Array?

The choice of using TRAINER_CONFIGS: Dict[Framework, Trainer] instead of an array of trainers was a performance and design optimization:

Performance Benefits

# Current efficient approach with dictionary trainer = TRAINER_CONFIGS[Framework.DEEPSPEED] # O(1) lookup framework = trainer.framework # Direct access # Alternative inefficient approach with array def find_trainer_by_framework(framework): for trainer in TRAINER_ARRAY: # O(n) search if trainer.framework == framework: return trainer

Design Benefits

Fast Lookup: O(1) constant time access instead of O(n) linear search

Type Safety: Dictionary keys ensure we only access valid frameworks

Explicit Mapping: Clear relationship between framework and trainer configuration

Extensibility: Easy to add new frameworks without changing lookup logic

My Take

The framework field serves critical architectural purposes for API design and object encapsulation. The dictionary structure provides performance benefits, but the field itself is essential for maintaining clean, self-contained objects.

Removing the field would break the original design intent, make the API less clean and efficient, and potentially introduce breaking changes. The field was intentionally designed this way for good reasons, and I believe we should keep it to maintain the integrity of the API design.

I agree that we should have dict to represent all Trainers where key is the Framework name and value is the Trainer object.
The question is should we also keep framework argument in the Trainer object. This is mostly used to just show users what framework this Trainer is using.

I am fine to keep it for now.

WDYT @szaher @astefanutti @Electronic-Waste ?

andreyvelich · 2025-07-03T14:19:59Z

python/kubeflow/trainer/utils/utils.py

+    trainer = detect_trainer_from_image_patterns(image_name)
+    if trainer:
+        return trainer
+
+    # 2. Fall back to DEFAULT_TRAINER
+    return copy.deepcopy(types.DEFAULT_TRAINER)


I think, this could be simplified if detect_trainer_from_image_patterns just return copy.deepcopy(types.DEFAULT_TRAINER) instead of None.

Can you just keep all of the required code to extract trainer in the get_trainer_from_image() function, which accepts image_name as input?

That will make our unit tests easier to maintain.

Made necessary changes to simplify the detect_trainer_from_image_patterns function

@jskswamy Sorry for the late reply, I meant can you just use this code snippet ?

def get_runtime_trainer(....): .... image_name = trainer_container.image.split(":")[0] trainer = get_trainer_from_image(image_name) def get_trainer_from_image(image_name: str) -> types.Trainer: """ Detect trainer type based on image name patterns using regex. This method uses pattern matching on the image name to determine the likely trainer type. Args: image_name: The container image name. Returns: Trainer: Trainer object if detected, otherwise the DEFAULT_TRAINER is returned. """ # DeepSpeed patterns if re.search(r"deepspeed", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.DEEPSPEED]) # MLX patterns if re.search(r"mlx", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.MLX]) # TorchTune patterns (check before PyTorch to avoid conflicts) if re.search(r"torchtune", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.TORCHTUNE]) # PyTorch patterns - require explicit "pytorch" in image name for clarity if re.search(r"pytorch", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.TORCH]) return copy.deepcopy(types.DEFAULT_TRAINER)

I've simplified the function as per your suggestion, kindly check

…sulation - Add optional default parameter to detect_trainer_from_image_patterns() - Handle copy.deepcopy() internally for better encapsulation - Remove boilerplate code from detect_trainer() function - Add comprehensive unit tests with proper separation of concerns - Maintain backward compatibility with existing behavior Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

andreyvelich

Sorry for the late review, I think we almost ready to merge this.
@szaher @kramaranya @briangallagher @eoinfennessy Can you take a look as well please ?

andreyvelich · 2025-07-17T20:16:43Z

python/kubeflow/trainer/utils/utils.py

+    trainer = detect_trainer_from_image_patterns(image_name)
+    if trainer:
+        return trainer
+
+    # 2. Fall back to DEFAULT_TRAINER
+    return copy.deepcopy(types.DEFAULT_TRAINER)


@jskswamy Sorry for the late reply, I meant can you just use this code snippet ?

def get_runtime_trainer(....): .... image_name = trainer_container.image.split(":")[0] trainer = get_trainer_from_image(image_name) def get_trainer_from_image(image_name: str) -> types.Trainer: """ Detect trainer type based on image name patterns using regex. This method uses pattern matching on the image name to determine the likely trainer type. Args: image_name: The container image name. Returns: Trainer: Trainer object if detected, otherwise the DEFAULT_TRAINER is returned. """ # DeepSpeed patterns if re.search(r"deepspeed", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.DEEPSPEED]) # MLX patterns if re.search(r"mlx", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.MLX]) # TorchTune patterns (check before PyTorch to avoid conflicts) if re.search(r"torchtune", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.TORCHTUNE]) # PyTorch patterns - require explicit "pytorch" in image name for clarity if re.search(r"pytorch", image_name, re.IGNORECASE): return copy.deepcopy(types.TRAINER_CONFIGS[types.Framework.TORCH]) return copy.deepcopy(types.DEFAULT_TRAINER)

andreyvelich · 2025-07-17T20:17:13Z

python/kubeflow/trainer/utils/utils.py

+    if ml_policy.torch and ml_policy.torch.num_proc_per_node is not None:
        num_proc = ml_policy.torch.num_proc_per_node.actual_instance
        if isinstance(num_proc, int):
            trainer.accelerator_count = num_proc
-    elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node:
+    elif ml_policy.mpi and ml_policy.mpi.num_proc_per_node is not None:


@jskswamy Did you get a chance to check this comment ?

Simplify trainer detection API by removing optional default parameter and always returning a Trainer object. The function now directly returns DEFAULT_TRAINER when no regex patterns match, eliminating the need for None handling in calling code. Changes: - Rename function to get_trainer_from_image for clarity - Remove optional default parameter from function signature - Always return types.Trainer instead of Optional[types.Trainer] - Update all test cases to expect DEFAULT_TRAINER for unknown images - Simplify detect_trainer() function logic Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Changes: - For torch: check actual_instance value truthiness, not just object existence - For MPI: already correctly validates the direct value - Zero values (0) are now ignored (treated as None) - Negative values are trusted as explicit user input - Update test cases to reflect new behavior Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

astefanutti

I may not have the full context, so please pardon my naive question, why those metadata, mostly the framework type, could not come from the training runtime itself, in the form of an annotation or a label?

For custom images, as a platform admin / user, I would understand I need to provide some hints.

andreyvelich · 2025-07-23T17:45:49Z

o please pardon my naive question, why those metadata, mostly the framework type, could not come from the training runtime itself, in the form of an annotation or a label?

Previously we talked with @tenzen-y and @Electronic-Waste about introducing labels to the runtime that define framework type: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1741266604716149?thread_ts=1741263570.091899&cid=C0742LDFZ4K

But we decided to not add more labels since this information can be retrieved from image and APIs.

astefanutti · 2025-07-23T18:00:04Z

But we decided to not add more labels since this information can be retrieved from image and APIs.

The implicit contract based on the image name might prove fragile for users and downstream projects.

I understand the regex-based heuristics could provide a nice last-resort as it stands, but making that contract explicit seems simpler and more robust. I still fail to see why it couldn't be enforced in the training runtime API, or at least the SDK would only fallback to the regex-based heuristics if that API contract is made optional.

andreyvelich · 2025-07-23T19:46:39Z

I understand the regex-based heuristics could provide a nice last-resort as it stands, but making that contract explicit seems simpler and more robust. I still fail to see why it couldn't be enforced in the training runtime API, or at least the SDK would only fallback to the regex-based heuristics if that API contract is made optional.

@astefanutti I think, we need to figure out why we even expose Runtime's trainer to the user.
Looking at the code it is only used to give user information about the Runtime:

sdk/python/kubeflow/trainer/types/types.py

Line 166 in fa5778b

class Trainer:

Information such as:

whether this runtime can be used with CustomTrainer or with BuiltinTrainer. This is important data since users can incorrectly use CustomTrainer with builtin Runtimes cc @Electronic-Waste
What ML framework users should use with the runtime (I think, this might be removed if we add get_runtime_packages() API like I showed in the KubeCon previously: https://youtu.be/Fnb1a5Kaxgo?t=556

Entrypoint that is used while getting the TrainJob steps:

sdk/python/kubeflow/trainer/utils/utils.py

Line 260 in fa5778b

trainjob_runtime.trainer.entrypoint

and while creating the TrainJob using CustomTrainer from function:

sdk/python/kubeflow/trainer/utils/utils.py

Lines 340 to 341 in fa5778b

    
           if runtime.trainer.entrypoint is None: 
        
               raise Exception(f"Runtime trainer must have an entrypoint: {runtime.trainer}")

Do we think that we can refactor some of this and remove the TRAINER_CONFIGS list ?
Thoughts @astefanutti @szaher @kramaranya @eoinfennessy ?

astefanutti · 2025-07-24T15:24:52Z

whether this runtime can be used with CustomTrainer or with BuiltinTrainer. This is important data since users can incorrectly use CustomTrainer with builtin Runtimes cc @Electronic-Waste

Right, that equally applies to the BuiltinTrainer.config. For now there is only one TorchTuneConfig, but nothing guarantees it's compatible with the training runtime when there'll be more.

Do we think that we can refactor some of this and remove the TRAINER_CONFIGS list ?

For built-in trainers, it seems there is a tight coupling between the trainer and the runtime, so maybe folding things into runtime as the "source-of-truth" would be better.

andreyvelich · 2025-07-24T16:27:23Z

For built-in trainers, it seems there is a tight coupling between the trainer and the runtime, so maybe folding things into runtime as the "source-of-truth" would be better.

So do you mean that Runtime should tell users whether it is meant for CustomTrainer or BuiltinTrainer ?

astefanutti · 2025-07-24T16:46:44Z

For built-in trainers, it seems there is a tight coupling between the trainer and the runtime, so maybe folding things into runtime as the "source-of-truth" would be better.

So do you mean that Runtime should tell users whether it is meant for CustomTrainer or BuiltinTrainer ?

Yes, one way or another. How a runtime is supposed to be used in the SDK is logically defined by the runtime, that includes the type of trainer (built-in, custom) and the framework (PyTorch, JAX, TorchTune, ...).

andreyvelich · 2025-07-24T16:58:03Z

@astefanutti Do you think that framework information is still useful for SDK users if they can always run get_runtime_packages() API ?

astefanutti · 2025-07-24T17:58:23Z

@astefanutti Do you think that framework information is still useful for SDK users if they can always run get_runtime_packages() API ?

No, though it'd be needed for checking the typed configuration passed by users for built-in trainers is compatible with the training runtime?

andreyvelich · 2025-07-24T20:04:27Z

No, though it'd be needed for checking the typed configuration passed by users for built-in trainers is compatible with the training runtime?

This is correct, additionally we can't run the get_runtime_packages() API for BuiltinTrainer runtimes since by default it contains script for fine-tuning: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml#L71.
We've done, so users can simple run this to fine-tune LLM:

client.train(
  runtime=Runtime(name="torchtune-llama3.2-3b")
)

Also, I don't think that users needs to know about installed packages in such runtimes, since they can only modify the config (e.g. fine-tuning parameters), but not the runtime packages.

Maybe for BuiltinTrainer runtime we should have two labels:

trainer.kubeflow.org/trainer-type: builtin
trainer.kubeflow.org/builtin-config: torchtune

If we don't want to introduce 2nd label, we can just tell users to rely on runtime name.

Thoughts @tenzen-y @astefanutti @Electronic-Waste @rudeigerc @szaher @kramaranya ?

andreyvelich · 2025-07-24T20:10:35Z

I think, we should refactor our Runtime class: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/types/types.py#L176-L179

astefanutti · 2025-07-25T08:11:20Z

Maybe for BuiltinTrainer runtime we should have two labels:

trainer.kubeflow.org/trainer-type: builtin

trainer.kubeflow.org/builtin-config: torchtune

Yes, labels seem the most straightforward approach. There is already the trainer.kubeflow.org/accelerator label.
I wonder whether that'd make sense to go as far as to enforce those labels during training runtime admission?

eoinfennessy · 2025-07-25T11:25:36Z

Agreed that it would be better to add APIs to TrainingRuntime and ClusterTrainingRuntime to specify the framework instead of relying on image names and regex checks. Adding trainer type would also be useful.

But why use labels instead of adding framework and other fields to the runtime spec? This would allow us to use schema-based validation to ensure a valid framework is provided.

One idea for this that would use cross-field validation to ensure one and only one of customTrainerConfig or builtinTrainerConfig is provided. Fields framework and type could use enum-based validation:

spec:
  customTrainerConfig:
    framework: "torch"
  # OR
  builtinTrainerConfig:
    type: "torchtune"

Probably best to consider the exact APIs alongside work on kubeflow/trainer#2752.

astefanutti · 2025-07-25T11:32:36Z

But why use labels instead of adding framework and other fields to the runtime spec? This would allow us to use schema-based validation to ensure a valid framework is provided.

@eoinfennessy I agree with you it's a possible alternative. Labels are flexible and enable listing runtimes by label selectors, but we could conceptually consider these metadata as part of the spec.

eoinfennessy · 2025-07-25T11:44:21Z

Labels are flexible and enable listing runtimes by label selectors

Ah, I hadn't considered that. Yes, that could help improve the UX of the list_runtimes method by filtering results. e.g:

client.list_runtimes(trainerType="custom", framework="torch")

andreyvelich · 2025-07-25T15:07:20Z

There is already the trainer.kubeflow.org/accelerator label.

I am not sure if we should continue to maintain this label. IIRC, @tenzen-y has concerns introducing this label in the runtimes.

Labels are flexible and enable listing runtimes by label selectors,

We can also use field selector, if we introduce a new API in the runtime.

What are the pros and cons to add this property under labels or APIs ?

astefanutti · 2025-07-25T15:24:12Z

There is already the trainer.kubeflow.org/accelerator label.

I am not sure if we should continue to maintain this label. IIRC, @tenzen-y has concerns introducing this label in the runtimes.>

You're right, it may not be a good example "semantically".

Labels are flexible and enable listing runtimes by label selectors,

We can also use field selector, if we introduce a new API in the runtime.

I'm not sure custom fields from CRDs are indexed. It might be only few fields from core APIs.

What are the pros and cons to add this property under labels or APIs ?

I would say labels are more "free-form" and not as strictly part of the API contract compared to fields.

andreyvelich · 2025-07-27T22:48:19Z

I'm not sure custom fields from CRDs are indexed. It might be only few fields from core APIs.

You are right @astefanutti, here is the list of supported fields: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/#list-of-supported-fields

You're right, it may not be a good example "semantically".

Let me remove this label for now, unless we design better way to explain users the accelerator types in the Runtime.

@Electronic-Waste @astefanutti @tenzen-y Any concerns to introduce these three labels to our runtimes for now ?

trainer.kubeflow.org/trainer-type: custom
or
trainer.kubeflow.org/trainer-type: builtin
trainer.kubeflow.org/builtin-config: torchtune

Alternatively, we can introduce framework label to the runtimes, but not sure if that is really needed, since users only need to know builtin config type to use while creating TrainJob using BuiltinTrainer.

trainer.kubeflow.org/trainer-type: custom
trainer.kubeflow.org/framework: torch
---
trainer.kubeflow.org/trainer-type: custom
trainer.kubeflow.org/framework: deepspeed
---
trainer.kubeflow.org/trainer-type: builtin
trainer.kubeflow.org/framework: torchtune

astefanutti · 2025-07-28T07:50:52Z

@Electronic-Waste @astefanutti @tenzen-y Any concerns to introduce these three labels to our runtimes for now ?
trainer.kubeflow.org/trainer-type: custom
or
trainer.kubeflow.org/trainer-type: builtin
trainer.kubeflow.org/builtin-config: torchtune

I think that's a good start. I only wonder whether those should be within the sdk.kubeflow.org prefix since it's metadata meant for the SDK.

Alternatively, we can introduce framework label to the runtimes, but not sure if that is really needed, since users only need to know builtin config type to use while creating TrainJob using BuiltinTrainer.
trainer.kubeflow.org/trainer-type: custom
trainer.kubeflow.org/framework: torch
---
trainer.kubeflow.org/trainer-type: custom
trainer.kubeflow.org/framework: deepspeed
---
trainer.kubeflow.org/trainer-type: builtin
trainer.kubeflow.org/framework: torchtune

Actually framework might be a different kind of metadata than those meant to be hints for the train API in the SDK.
That one would might be a simpler way to display the runtime framework rather than relying on the runtime name.

andreyvelich · 2025-07-28T11:03:56Z

I think that's a good start. I only wonder whether those should be within the sdk.kubeflow.org prefix since it's metadata meant for the SDK.

I think, if we keep the sdk.kubeflow.org, we need to ensure that we don't require to use the same label for other Kubeflow projects that we want to integrate into SDK.

Actually framework might be a different kind of metadata than those meant to be hints for the train API in the SDK.

@astefanutti If we establish the contract that builtin configs contain framework name in the DataClass name, the trainer.kubeflow.org/framework: torchtune is sufficient.

astefanutti · 2025-07-28T12:18:17Z

@astefanutti If we establish the contract that builtin configs contain framework name in the DataClass name, the trainer.kubeflow.org/framework: torchtune is sufficient.

That would be good yes. Having a mapping between framework name and DataClass name in the SDK would be perfectly acceptable I think.

google-oss-prow bot requested review from Electronic-Waste, andreyvelich and tenzen-y June 18, 2025 12:20

google-oss-prow bot added the size/XL label Jun 18, 2025

eoinfennessy reviewed Jun 20, 2025

View reviewed changes

jskswamy force-pushed the fix/trainer-detection-custom-images branch from ed91f02 to e16b9c6 Compare June 23, 2025 09:15

eoinfennessy approved these changes Jun 23, 2025

View reviewed changes

andreyvelich reviewed Jun 27, 2025

View reviewed changes

jskswamy mentioned this pull request Jun 30, 2025

Add optional dependencies for testing #35

Closed

jskswamy force-pushed the fix/trainer-detection-custom-images branch from ec4a1c2 to e7f4425 Compare June 30, 2025 12:19

jskswamy added 10 commits June 30, 2025 18:17

jskswamy force-pushed the fix/trainer-detection-custom-images branch from e7f4425 to b3aed48 Compare June 30, 2025 12:47

andreyvelich reviewed Jul 3, 2025

View reviewed changes

jskswamy force-pushed the fix/trainer-detection-custom-images branch from 7f6fa23 to d3c0043 Compare July 4, 2025 06:55

andreyvelich reviewed Jul 17, 2025

View reviewed changes

jskswamy added 2 commits July 23, 2025 14:40

astefanutti reviewed Jul 23, 2025

View reviewed changes

andreyvelich mentioned this pull request Jul 27, 2025

chore(trainer): Remove accelerator label from the runtimes #51

Merged

This was referenced Jul 30, 2025

feat(runtimes): Add Framework Label to the Runtimes kubeflow/trainer#2761

Merged

feat(trainer): Support Framework Labels in Runtimes #56

Merged

google-oss-prow bot closed this in #56 Aug 4, 2025

		return None


		def _detect_trainer_from_image_patterns(image_name: str) -> Optional[types.Trainer]:

Fix trainer detection for custom Docker images with regex pattern matching #31

Fix trainer detection for custom Docker images with regex pattern matching #31

Uh oh!

Conversation

jskswamy commented Jun 18, 2025

Problem

Solution

Key Changes

Testing

Uh oh!

google-oss-prow bot commented Jun 18, 2025

Uh oh!

eoinfennessy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jskswamy commented Jun 23, 2025

Uh oh!

eoinfennessy left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Jun 23, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1. Torch Policy Check (if trainer_container.accelerator_count is not None)

2. MPI Policy Check (if trainer_container.mpi_policy is not None)

3. Semantic Correctness

Code Comments Added:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

The is not None Check is Necessary

Why Truthiness Checking Would Break CPU-Only Training

Example Scenarios:

The Key Distinction

Uh oh!

andreyvelich Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Current Regex Approach — Pragmatic Starting Point

Future API-Based Enhancement

Approach

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jskswamy Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

The framework Field Has Critical Importance

Critical Importance for API Design

Self-Contained Data Structure

Breaking Changes Would Be Required

Architectural Integrity

Why Dictionary Instead of Array?

Performance Benefits

Design Benefits

1. Torch Policy Check (`if trainer_container.accelerator_count is not None`)

2. MPI Policy Check (`if trainer_container.mpi_policy is not None`)

The `is not None` Check is Necessary

andreyvelich Jul 7, 2025 •

edited

Loading

jskswamy Jul 4, 2025 •

edited

Loading

The `framework` Field Has Critical Importance

astefanutti left a comment •

edited

Loading

andreyvelich commented Jul 23, 2025 •

edited

Loading

andreyvelich commented Jul 24, 2025 •

edited

Loading

andreyvelich commented Jul 24, 2025 •

edited

Loading