Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Commit c4a815d

Browse files
authored
Add a drain period to server shutdown (#591)
When a baseplate-serve process receives a SIGTERM, it puts the server into a graceful shutdown state immediately. New requests are no longer accepted and the process continues handling in-flight requests until they're all complete or stop_timeout seconds have elapsed. Kubernetes won't mark the pod as TERMINATING until the server has completed its graceful shutdown or has spent long enough shutting down that it fails to respond to a liveness healthcheck and is forceably shut down. During this intervening time, new requests will still be routed to the pod but the server will not be listening for them. So they get dropped on the floor. To prevent this, we add a drain_time period that happens before the graceful shutdown is kicked off. If configured, a SIGTERM will cause baseplate-serve to set a global flag indicating shutdown has begun and then it will wait the specified time until beginning the actual graceful shutdown. This gives the application a chance to deliberately fail READINESS healthchecks during that grace period and get taken out of rotation so that once it starts graceful shutdown it should not be getting any new requests.
1 parent 10d5082 commit c4a815d

File tree

2 files changed

+60
-2
lines changed

2 files changed

+60
-2
lines changed

baseplate/server/__init__.py

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,11 @@
1515
import socket
1616
import sys
1717
import threading
18+
import time
1819
import traceback
1920
import warnings
2021

22+
from dataclasses import dataclass
2123
from types import FrameType
2224
from typing import Any
2325
from typing import Callable
@@ -34,6 +36,9 @@
3436
from baseplate import Baseplate
3537
from baseplate.lib.config import Endpoint
3638
from baseplate.lib.config import EndpointConfiguration
39+
from baseplate.lib.config import Optional as OptionalConfig
40+
from baseplate.lib.config import parse_config
41+
from baseplate.lib.config import Timespan
3742
from baseplate.lib.log_formatter import CustomJsonFormatter
3843
from baseplate.server import einhorn
3944
from baseplate.server import reloader
@@ -42,6 +47,14 @@
4247
logger = logging.getLogger(__name__)
4348

4449

50+
@dataclass
51+
class ServerState:
52+
shutting_down: bool = False
53+
54+
55+
SERVER_STATE = ServerState()
56+
57+
4558
def parse_args(args: Sequence[str]) -> argparse.Namespace:
4659
parser = argparse.ArgumentParser(
4760
description=sys.modules[__name__].__doc__,
@@ -237,6 +250,8 @@ def load_app_and_run_server() -> None:
237250
listener = make_listener(args.bind)
238251
server = make_server(config.server, listener, app)
239252

253+
cfg = parse_config(config.server, {"drain_time": OptionalConfig(Timespan)})
254+
240255
if einhorn.is_worker():
241256
einhorn.ack_startup()
242257

@@ -246,13 +261,20 @@ def load_app_and_run_server() -> None:
246261
# clean up leftovers from initialization before we get into requests
247262
gc.collect()
248263

249-
logger.info("Listening on %s, PID:%s", listener.getsockname(), os.getpid())
264+
logger.info("Listening on %s", listener.getsockname())
250265
server.start()
251266
try:
252267
shutdown_event.wait()
253-
logger.info("Finally stopping server, PID:%s", os.getpid())
268+
269+
SERVER_STATE.shutting_down = True
270+
271+
if cfg.drain_time:
272+
logger.debug("Draining inbound requests...")
273+
time.sleep(cfg.drain_time.total_seconds())
254274
finally:
275+
logger.debug("Gracefully shutting down...")
255276
server.stop()
277+
logger.info("Exiting")
256278

257279

258280
def load_and_run_script() -> None:

docs/cli/serve.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,42 @@ An example command line::
126126

127127
.. _Stripe's Einhorn socket manager: https://github.com/stripe/einhorn
128128

129+
Graceful shutdown
130+
-----------------
131+
132+
The flow of graceful shutdown while handling live traffic looks like this:
133+
134+
* The server receives a ``SIGTERM`` from the infrastructure.
135+
* The server sets ``baseplate.server.SERVER_STATE.shutting_down`` to ``True``.
136+
* If the ``drain_time`` setting is set in the server configuration, the server
137+
will wait the specified amount of time before continuing to the next step.
138+
This gives your application a chance to use the ``shutting_down`` flag in
139+
healthcheck responses.
140+
* The server begins graceful shutdown. No new connections will be accepted. The
141+
server will continue processing the existing in-flight requests until they
142+
are all done or ``stop_timeout`` time has elapsed.
143+
* The server exits and lets the infrastructure clean up.
144+
145+
During the period between receiving the ``SIGTERM`` and the server exiting, the
146+
application may still be routed new requests. To ensure requests aren't lost
147+
during the graceful shutdown (where they won't be listened for) your
148+
application should set an appropriate ``drain_time`` and use the
149+
``SERVER_STATE.shutting_down`` flag to fail ``READINESS`` healthchecks.
150+
151+
For example:
152+
153+
.. code-block:: py
154+
155+
def is_healthy(self, context, healthcheck_request):
156+
if healthcheck_request.probe == IsHealthyProbe.LIVENESS:
157+
return True
158+
elif healthcheck_request.probe == IsHealthyProbe.READINESS:
159+
if SERVER_STATE.shutting_down:
160+
return False
161+
return True
162+
return True
163+
164+
129165
Debug Signal
130166
------------
131167

0 commit comments

Comments
 (0)