Skip to content

fix: spot interruption detection #2391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

saikonen
Copy link
Collaborator

@saikonen saikonen commented Apr 25, 2025

fix for sending SIGTERM to the main process when a spot interruption is detected.

os.ppid() is not reliable inside a Sidecar due to nesting of subprocesses, so we need to instead pass the main process id via some other means. This PR does it with an environment variable.

Also noteworthy is that because tasks running on Kubernetes have termination_grace_period_seconds=0, this makes graceful handling of SIGTERM impossible as the pod goes down immediately upon receiving it.

Flow for testing with AWS FIS:

import signal
from time import sleep
from metaflow import step, FlowSpec, retry, current, kubernetes, batch

class RetryFlow(FlowSpec):

    @kubernetes
    # @batch
    @retry(times=3)
    @step
    def start(self):
        print("Starting 👋")
        
        def _debug_signal(*args, **kwargs):
            print("SIGTERM RECEIVED")
        
        signal.signal(signal.SIGTERM, _debug_signal)
        
        print("sleeping for 300")
        sleep(300)
        if current.retry_count<1:
            raise Exception("You shall not pass (to the next step!)")
        
        self.next(self.end)

    @step
    def end(self):
        print("Done! 🏁")

if __name__ == "__main__":
    RetryFlow()

@nflx-mf-bot
Copy link
Collaborator

Netflix internal testing[1084] @ fd872fd

@nflx-mf-bot
Copy link
Collaborator

Testing for OSS PR [1084] @ commit fd872fd had 10 FAILUREs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants