Skip to content

Conversation

cf-sewe
Copy link
Contributor

@cf-sewe cf-sewe commented Oct 13, 2025

Problem

After PR #208 fixed container exit code validation, a new issue emerged: when container failures are detected and an exception is thrown, pod deletion was skipped, leaving pods running indefinitely despite delete: true configuration.

This is a resource leak issue - failed pods accumulate in the cluster and are never cleaned up, even though the task correctly reports the failure.

Solution

Main fix:

  • Wrap the completion logic in a try-finally block to ensure pod deletion always runs, even when exceptions occur

Code improvements:

  • Refactor handleEnd() to accept hasOutputFiles parameter for clearer separation of concerns
  • Move failure checking logic into handleEnd() to centralize log collection and failure validation
  • Add Logger parameter to PodService.checkContainerFailures() to ensure errors are logged before exceptions are thrown
  • Ensure consistent behavior: wait for logs (30s), check for failures, then cleanup

Testing

Enhanced and added comprehensive test cases:

Enhanced existing test:

  • failedWithOutputFiles() - Now verifies logs are collected AND pods are deleted

New test cases:

  • failedWithOutputFilesDeletesPod() - Explicitly tests pod deletion with async monitoring
  • successWithOutputFiles() - Verifies success path with outputFiles
  • multipleContainersOneFailsWithOutputFiles() - Tests multi-container failure detection with log and deletion verification
  • logCollectionTimingWithOutputFiles() - Verifies waitForLogInterval is respected before failure

All failure tests now verify:

  1. Container failures are detected and reported correctly
  2. Logs are collected before cleanup (verifies handleEnd() timing)
  3. Pods are properly deleted after failure (uses Await.until for async verification)
  4. Error details are logged before exceptions are thrown

Note

Test execution time: ~8-9 minutes for full PodCreateTest suite. Duration slightly increased due to additional deletion verification and reliable log collection timing.

cf-sewe and others added 3 commits October 13, 2025 11:36
When checkContainerFailures throws an exception after detecting a
failed container, the delete() call was being skipped, leaving pods
running with only the sidecar container active.

Wrapped the completion check and output handling in a try-finally
block to ensure delete() is always called, even when the task fails.

Added test failedWithOutputFilesDeletesPod() to verify pods are
properly deleted despite container failures.

Fixes pod cleanup issue reported by Maxime Yvart in PR kestra-io#208.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ests

Refactored the handleEnd() method to better separate concerns between
pods with and without outputFiles, and enhanced error logging.

Changes:
- Move failure checking logic into handleEnd() with hasOutputFiles parameter
- Add logger parameter to PodService.checkContainerFailures() for better error visibility
- Simplify main flow by removing duplicate error logging code

Test Improvements:
- Add log collection verification to failedWithOutputFiles test
- Add pod deletion verification to failure tests (ensures cleanup works)
- Add successWithOutputFiles test for success case coverage
- Add multipleContainersOneFailsWithOutputFiles test (multi-container scenario)
- Add logCollectionTimingWithOutputFiles test (verifies waitForLogInterval behavior)

All failure tests now verify both:
1. Logs are collected before cleanup
2. Pods are properly deleted after failure (using Await.until)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Improvements to test coverage:
- Add deletion verification to failed() test to ensure pod cleanup works
  in the non-outputFiles failure path (complements existing
  failedWithOutputFilesDeletesPod test)
- Fix multipleContainersOneFailsWithOutputFiles() by increasing
  waitForLogInterval from 5s to 30s to reliably capture container logs
  before pod deletion (matches pattern used in other log verification tests)

Both failure paths now have deletion verification:
- Without outputFiles: failed() test
- With outputFiles: failedWithOutputFilesDeletesPod() test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@loicmathieu
Copy link
Member

@fdelbrayelle if this is accepted, the same may be done inside the Kubernetes task runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To review

Development

Successfully merging this pull request may close these issues.

3 participants