Skip to content

Commit 8441179

Browse files
Sam Luryefacebook-github-bot
authored andcommitted
Big debugging update (#456)
Summary: Pull Request resolved: #456 This diff contains several updates to the monarch actor mesh debugging experience. - Improved debug input parsing, with the new `cast` command supporting more sophisticated rank selection grammar: - `cast ranks(3) pdb_command`: send `pdb_command` to rank 3. - `cast ranks(1,3,5) pdb_command`: send `pdb_command` to ranks 1, 3 and 5. - `cast ranks(1:10:2) pdb_command`: send `pdb_command` to the ranks in `range(start=1, stop=10, step=2)`. - `cast ranks(pp=2, dp=(1,3), tp=2:8) pdb_command`: send `pdb_command` to ranks with `pp` dim 2, `dp` dim 1 or 3, and `tp` dim in `range(2,8)`. - The debug client is now automatically registered with an actor mesh when that actor mesh is spawned. This means calling `init_debugging(actor_mesh)` is no longer necessary. - Debugging now works with MAST jobs, by enforcing that breakpoints aren't set in `__main__`, and that the file containing the breakpoint exists on the remote host. - The first requirement is due to how `cloudpickle` works -- if an actor endpoint is defined inside `__main__`, `cloudpickle` will serialize it by value instead of by reference. When the code then runs on the remote host, it thinks the location of the code is the user's local `__main__` file, which confuses pdb, because the file doesn't exist at the same path (or may not exist at all) on the remote host. - The second requirement is due to important parts of `pdb`'s implementation relying on the ability to search for the file being debugged on the remote host's file system. - A debugging session for a specific rank is now forcefully exited once the endpoint finishes execution. This contains the debugging experience within user-authored code. It is also necessary for preventing hangs, because if pdb is allowed to continue indefinitely, then control flow will eventually bubble back up to the main asyncio event loop on the worker, at which point everything breaks. - Hitting a breakpoint now automatically enables post-mortem debugging, so any rank that encounters an exception after hitting a breakpoint will automatically stop at the exception. Attaching the debugger to that rank should then provide an experience like `pdb.post_mortem()`. ## Next steps/gaps I'm aware of (reviewers please read): - Indexing debug sessions by rank isn't sustainable, because two actor meshes may simultaneously hit breakpoints on the same rank and cause a collision inside the debug client. - Entering the debug client should happen automatically, rather than requiring the user to do `await debug_client().enter.call_one()`. - Casting pdb commands should ideally leverage `MeshTrait` rather than reimplementing the selection logic. - If a mesh was reshaped/renamed so that its dimension names aren't `hosts` and `gpus` anymore, the debugger should reflect the new shape/names. - The user should be able to enable post-mortem debugging without having to hit a separate breakpoint first. Reviewed By: zdevito Differential Revision: D77568423 fbshipit-source-id: edc05ae7e7ea7f943de2751e01a6486dce7d1903
1 parent b36a3e1 commit 8441179

File tree

9 files changed

+846
-305
lines changed

9 files changed

+846
-305
lines changed

python/monarch/_src/actor/actor_mesh.py

Lines changed: 48 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
import inspect
1313
import logging
1414
import random
15-
import sys
1615
import traceback
1716

1817
from dataclasses import dataclass
@@ -53,15 +52,12 @@
5352
from monarch._rust_bindings.monarch_hyperactor.telemetry import enter_span, exit_span
5453
from monarch._src.actor.allocator import LocalAllocator, ProcessAllocator
5554
from monarch._src.actor.future import Future
56-
from monarch._src.actor.pdb_wrapper import remote_breakpointhook
55+
from monarch._src.actor.pdb_wrapper import PdbWrapper
5756

5857
from monarch._src.actor.pickle import flatten, unpickle
5958

6059
from monarch._src.actor.shape import MeshTrait, NDSlice
6160

62-
if TYPE_CHECKING:
63-
from monarch._src.actor.debugger import DebugClient
64-
6561
logger: logging.Logger = logging.getLogger(__name__)
6662

6763
Allocator = ProcessAllocator | LocalAllocator
@@ -97,6 +93,23 @@ def get() -> "MonarchContext":
9793
)
9894

9995

96+
@dataclass
97+
class DebugContext:
98+
pdb_wrapper: Optional[PdbWrapper] = None
99+
100+
@staticmethod
101+
def get() -> "DebugContext":
102+
return _debug_context.get()
103+
104+
@staticmethod
105+
def set(debug_context: "DebugContext") -> None:
106+
_debug_context.set(debug_context)
107+
108+
109+
_debug_context: contextvars.ContextVar[DebugContext] = contextvars.ContextVar(
110+
"monarch.actor_mesh._debug_context"
111+
)
112+
100113
T = TypeVar("T")
101114
P = ParamSpec("P")
102115
R = TypeVar("R")
@@ -538,6 +551,8 @@ async def handle_cast(
538551
)
539552
_context.set(ctx)
540553

554+
DebugContext.set(DebugContext())
555+
541556
args, kwargs = unpickle(message.message, mailbox)
542557

543558
if message.method == "__init__":
@@ -574,9 +589,10 @@ async def instrumented():
574589
)
575590
try:
576591
result = await the_method(self.instance, *args, **kwargs)
592+
self._maybe_exit_debugger()
577593
except Exception as e:
578594
logging.critical(
579-
"Unahndled exception in actor endpoint",
595+
"Unhandled exception in actor endpoint",
580596
exc_info=e,
581597
)
582598
raise e
@@ -589,11 +605,13 @@ async def instrumented():
589605
the_method.__module__, message.method, str(ctx.mailbox.actor_id)
590606
)
591607
result = the_method(self.instance, *args, **kwargs)
608+
self._maybe_exit_debugger()
592609
exit_span()
593610

594611
if port is not None:
595612
port.send("result", result)
596613
except Exception as e:
614+
self._post_mortem_debug(e.__traceback__)
597615
traceback.print_exc()
598616
s = ActorError(e)
599617

@@ -604,6 +622,7 @@ async def instrumented():
604622
else:
605623
raise s from None
606624
except BaseException as e:
625+
self._post_mortem_debug(e.__traceback__)
607626
# A BaseException can be thrown in the case of a Rust panic.
608627
# In this case, we need a way to signal the panic to the Rust side.
609628
# See [Panics in async endpoints]
@@ -614,6 +633,29 @@ async def instrumented():
614633
pass
615634
raise
616635

636+
def _maybe_exit_debugger(self, do_continue=True) -> None:
637+
if (pdb_wrapper := DebugContext.get().pdb_wrapper) is not None:
638+
if do_continue:
639+
pdb_wrapper.clear_all_breaks()
640+
pdb_wrapper.do_continue("")
641+
pdb_wrapper.end_debug_session()
642+
DebugContext.set(DebugContext())
643+
644+
def _post_mortem_debug(self, exc_tb) -> None:
645+
from monarch._src.actor.debugger import DebugManager
646+
647+
if (pdb_wrapper := DebugContext.get().pdb_wrapper) is not None:
648+
ctx = MonarchContext.get()
649+
pdb_wrapper = PdbWrapper(
650+
ctx.point.rank,
651+
ctx.point.shape.coordinates(ctx.point.rank),
652+
ctx.mailbox.actor_id,
653+
DebugManager.ref().get_debug_client.call_one().get(),
654+
)
655+
DebugContext.set(DebugContext(pdb_wrapper))
656+
pdb_wrapper.post_mortem(exc_tb)
657+
self._maybe_exit_debugger(do_continue=False)
658+
617659

618660
def _is_mailbox(x: object) -> bool:
619661
return isinstance(x, Mailbox)
@@ -648,19 +690,6 @@ def _new_with_shape(self, shape: Shape) -> "ActorMeshRef":
648690
"actor implementations are not meshes, but we can't convince the typechecker of it..."
649691
)
650692

651-
@endpoint # pyre-ignore
652-
def _set_debug_client(self, client: "DebugClient") -> None:
653-
point = MonarchContext.get().point
654-
# For some reason, using a lambda instead of functools.partial
655-
# confuses the pdb wrapper implementation.
656-
sys.breakpointhook = functools.partial( # pyre-ignore
657-
remote_breakpointhook,
658-
point.rank,
659-
point.shape.coordinates(point.rank),
660-
MonarchContext.get().mailbox.actor_id,
661-
client,
662-
)
663-
664693

665694
class ActorMeshRef(MeshTrait, Generic[T]):
666695
def __init__(

python/monarch/_src/actor/bootstrap_main.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ def invoke_main():
5454
except Exception as e:
5555
logging.warning(f"Failed to set up py-spy: {e}")
5656

57+
from monarch._src.actor.debugger import remote_breakpointhook
58+
59+
sys.breakpointhook = remote_breakpointhook
60+
5761
# Start an event loop for PythonActors to use.
5862
asyncio.run(main())
5963

0 commit comments

Comments
 (0)