Skip to content

Commit c649f2b

Browse files
Sam Luryefacebook-github-bot
authored andcommitted
Big debugging update (#456)
Summary: Pull Request resolved: #456 This diff contains several updates to the monarch actor mesh debugging experience. - Improved debug input parsing, with the new `cast` command supporting more sophisticated rank selection grammar: - `cast ranks(3) pdb_command`: send `pdb_command` to rank 3. - `cast ranks(1,3,5) pdb_command`: send `pdb_command` to ranks 1, 3 and 5. - `cast ranks(1:10:2) pdb_command`: send `pdb_command` to the ranks in `range(start=1, stop=10, step=2)`. - `cast ranks(pp=2, dp=(1,3), tp=2:8) pdb_command`: send `pdb_command` to ranks with `pp` dim 2, `dp` dim 1 or 3, and `tp` dim in `range(2,8)`. - The debug client is now automatically registered with an actor mesh when that actor mesh is spawned. This means calling `init_debugging(actor_mesh)` is no longer necessary. - Debugging now works with MAST jobs, by enforcing that breakpoints aren't set in `__main__`, and that the file containing the breakpoint exists on the remote host. - The first requirement is due to how `cloudpickle` works -- if an actor endpoint is defined inside `__main__`, `cloudpickle` will serialize it by value instead of by reference. When the code then runs on the remote host, it thinks the location of the code is the user's local `__main__` file, which confuses pdb, because the file doesn't exist at the same path (or may not exist at all) on the remote host. - The second requirement is due to important parts of `pdb`'s implementation relying on the ability to search for the file being debugged on the remote host's file system. - A debugging session for a specific rank is now forcefully exited once the endpoint finishes execution. This contains the debugging experience within user-authored code. It is also necessary for preventing hangs, because if pdb is allowed to continue indefinitely, then control flow will eventually bubble back up to the main asyncio event loop on the worker, at which point everything breaks. - Hitting a breakpoint now automatically enables post-mortem debugging, so any rank that encounters an exception after hitting a breakpoint will automatically stop at the exception. Attaching the debugger to that rank should then provide an experience like `pdb.post_mortem()`. ## Next steps/gaps I'm aware of (reviewers please read): - Indexing debug sessions by rank isn't sustainable, because two actor meshes may simultaneously hit breakpoints on the same rank and cause a collision inside the debug client. - Entering the debug client should happen automatically, rather than requiring the user to do `await debug_client().enter.call_one()`. - Casting pdb commands should ideally leverage `MeshTrait` rather than reimplementing the selection logic. - If a mesh was reshaped/renamed so that its dimension names aren't `hosts` and `gpus` anymore, the debugger should reflect the new shape/names. - The user should be able to enable post-mortem debugging without having to hit a separate breakpoint first. Differential Revision: D77568423
1 parent 1e06a7c commit c649f2b

File tree

7 files changed

+773
-293
lines changed

7 files changed

+773
-293
lines changed

python/monarch/_src/actor/actor_mesh.py

Lines changed: 57 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import functools
1212
import inspect
1313
import logging
14+
import os
1415
import random
1516
import sys
1617
import traceback
@@ -63,7 +64,7 @@
6364

6465
from monarch._src.actor.allocator import LocalAllocator, ProcessAllocator
6566
from monarch._src.actor.future import Future
66-
from monarch._src.actor.pdb_wrapper import remote_breakpointhook
67+
from monarch._src.actor.pdb_wrapper import PdbWrapper
6768

6869
from monarch._src.actor.pickle import flatten, unpickle
6970

@@ -584,9 +585,10 @@ async def instrumented():
584585
)
585586
try:
586587
result = await the_method(self.instance, *args, **kwargs)
588+
self.instance._maybe_exit_debugger()
587589
except Exception as e:
588590
logging.critical(
589-
"Unahndled exception in actor endpoint",
591+
"Unhandled exception in actor endpoint",
590592
exc_info=e,
591593
)
592594
raise e
@@ -599,11 +601,15 @@ async def instrumented():
599601
the_method.__module__, message.method, str(ctx.mailbox.actor_id)
600602
)
601603
result = the_method(self.instance, *args, **kwargs)
604+
# pyre-ignore
605+
self.instance._maybe_exit_debugger()
602606
exit_span()
603607

604608
if port is not None:
605609
port.send("result", result)
606610
except Exception as e:
611+
# pyre-ignore
612+
self.instance._post_mortem_debug(e.__traceback__)
607613
traceback.print_exc()
608614
s = ActorError(e)
609615

@@ -614,6 +620,7 @@ async def instrumented():
614620
else:
615621
raise s from None
616622
except BaseException as e:
623+
self.instance._post_mortem_debug(e.__traceback__)
617624
# A BaseException can be thrown in the case of a Rust panic.
618625
# In this case, we need a way to signal the panic to the Rust side.
619626
# See [Panics in async endpoints]
@@ -659,18 +666,63 @@ def _new_with_shape(self, shape: Shape) -> "ActorMeshRef":
659666
)
660667

661668
@endpoint # pyre-ignore
662-
def _set_debug_client(self, client: "DebugClient") -> None:
669+
def _set_debug_client(self, client: "ActorMeshRef[DebugClient]") -> None:
663670
point = MonarchContext.get().point
664671
# For some reason, using a lambda instead of functools.partial
665672
# confuses the pdb wrapper implementation.
666-
sys.breakpointhook = functools.partial( # pyre-ignore
667-
remote_breakpointhook,
673+
sys.breakpointhook = functools.partial(
674+
self._remote_breakpointhook,
668675
point.rank,
669676
point.shape.coordinates(point.rank),
670677
MonarchContext.get().mailbox.actor_id,
671678
client,
672679
)
673680

681+
def _remote_breakpointhook(
682+
self,
683+
rank: int,
684+
coords: Dict[str, int],
685+
actor_id: ActorId,
686+
client_ref: "DebugClient",
687+
) -> None:
688+
frame = inspect.currentframe()
689+
assert frame is not None
690+
frame = frame.f_back
691+
assert frame is not None
692+
file = frame.f_code.co_filename
693+
line = frame.f_lineno
694+
module = frame.f_globals.get("__name__", "__main__")
695+
if module == "__main__" and not os.path.exists(file):
696+
raise NotImplementedError(
697+
f"Remote debugging not supported for breakpoint at {file}:{line} because "
698+
f"it is defined inside __main__, and the file does not exist on the host. "
699+
"In this case, cloudpickle serialization does not interact nicely with pdb. "
700+
"To debug your code, move it out of __main__ and into a module that "
701+
"exists on both your client and worker processes."
702+
)
703+
704+
# pyre-ignore
705+
self._pdb_wrapper_args = (rank, coords, actor_id, client_ref)
706+
# pyre-ignore
707+
self._pdb_wrapper = PdbWrapper(*self._pdb_wrapper_args)
708+
self._pdb_wrapper.set_trace(frame)
709+
710+
def _post_mortem_debug(self, exc_tb) -> None:
711+
if hasattr(self, "_pdb_wrapper"):
712+
# pyre-ignore
713+
self._pdb_wrapper = PdbWrapper(*self._pdb_wrapper_args)
714+
self._pdb_wrapper.post_mortem(exc_tb)
715+
self._maybe_exit_debugger(do_continue=False)
716+
717+
def _maybe_exit_debugger(self, do_continue=True) -> None:
718+
if hasattr(self, "_pdb_wrapper"):
719+
if do_continue:
720+
# pyre-ignore
721+
self._pdb_wrapper.clear_all_breaks()
722+
self._pdb_wrapper.do_continue(None)
723+
self._pdb_wrapper.end_debug_session()
724+
del self._pdb_wrapper
725+
674726

675727
class ActorMeshRef(MeshTrait, Generic[T]):
676728
def __init__(

0 commit comments

Comments
 (0)