-
Notifications
You must be signed in to change notification settings - Fork 100
Open
Description
Problem Description
Multiple Tekton PipelineRun specifications in the repository are missing global timeout configuration, which can cause long-running builds to monopolize cluster resources indefinitely if they stall due to network issues, registry outages, or other failures.
Impact Analysis
Resource Monopolization Risk:
- PipelineRuns without timeouts can run indefinitely if builds stall
- Network connectivity issues or registry outages can cause builds to hang
- Cluster resources (CPU, memory, storage) remain allocated until manual intervention
- Other pipeline runs may be delayed or fail due to resource exhaustion
Inconsistent Configuration:
- Some pipelines have 8-hour timeouts while others have no timeout limits
- Creates operational inconsistency and unpredictable behavior
- Makes troubleshooting and capacity planning more difficult
Affected Files
Files Missing Timeouts (18 total):
.tekton/odh-pipeline-runtime-datascience-cpu-py311-ubi9-push.yaml
.tekton/odh-pipeline-runtime-datascience-cpu-py312-ubi9-push.yaml
.tekton/odh-pipeline-runtime-minimal-cpu-py311-ubi9-push.yaml
.tekton/odh-pipeline-runtime-minimal-cpu-py312-ubi9-push.yaml
.tekton/odh-workbench-codeserver-datascience-cpu-py311-ubi9-push.yaml
.tekton/odh-workbench-codeserver-datascience-cpu-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-datascience-cpu-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-datascience-cpu-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-minimal-cpu-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-minimal-cpu-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-minimal-cuda-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-minimal-cuda-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-pytorch-cuda-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-pytorch-cuda-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-pytorch-rocm-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-pytorch-rocm-py312-ubi9-push.yaml
.tekton/odh-workbench-jupyter-tensorflow-cuda-py311-ubi9-push.yaml
.tekton/odh-workbench-jupyter-tensorflow-cuda-py312-ubi9-push.yaml
Files With Correct Timeouts (14 total):
All remaining PipelineRun files already have the standard 8-hour timeout configuration.
Solution
Add a timeouts
block to the spec
section of each affected PipelineRun, following the established pattern used in other pipelines:
spec:
timeouts:
pipeline: 8h
params:
# ... existing parameters
Acceptance Criteria
- All 18 affected PipelineRun files have timeout configuration added
- Timeout duration is set to 8 hours (
pipeline: 8h
) to align with existing patterns - Timeout block is placed immediately after
spec:
and beforeparams:
for consistency - No functional changes to pipeline behavior other than timeout enforcement
- All affected pipelines can still complete successfully within the 8-hour limit
- Documentation is updated if necessary to reflect timeout policies
Implementation Guidance
- Consistent Placement: Add the timeout block immediately after
spec:
and beforeparams:
- Standard Duration: Use
pipeline: 8h
to match existing timeout configurations - Batch Processing: Consider grouping changes by pipeline type (runtime vs workbench) for easier review
- Testing: Verify that normal builds still complete within the timeout limit
- Monitoring: Consider adding alerting for pipelines approaching the timeout limit
Context
- Triggered by: PR RHOAIENG-28512: add py312 .tekton push pipelines #1379 review comment identifying missing timeout in new py312 pipeline
- Review comment: RHOAIENG-28512: add py312 .tekton push pipelines #1379 (comment)
- Pattern: This issue affects both newly added Python 3.12 pipelines and existing Python 3.11 pipelines
- Scope: Repository-wide consistency and resource management improvement
Benefits
- Resource Protection: Prevents indefinite resource allocation from stalled builds
- Operational Consistency: Standardizes timeout behavior across all pipelines
- Predictable Behavior: Makes pipeline execution time limits explicit and manageable
- Improved Troubleshooting: Failed builds timeout rather than hanging indefinitely
- Capacity Planning: Enables better cluster resource planning and scheduling
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
📋 Backlog