Add debug hook to support dump tensor data and add new debug functions easily #5182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

HuiGao-NV wants to merge 4 commits into NVIDIA:main from HuiGao-NV:debug_hook

+671 −0

Collaborator

HuiGao-NV commented Jun 13, 2025 •

edited

Loading

Add debug hook to support dump tensor data and add new debug functions easily.

To enable to dump tensors' data,
from tensorrt_llm._torch.debug.debug_hook import debugger_addon, register_tensor_dump_hook
with debugger_addon(model, DATA_FOLDER):
register_tensor_dump_hook()
model.forward()

The dumped data are put under DATA_FOLDER/rank[ID]/....
The data file name is in the pattern:
[LOOP_COUNT].[model_name]-[OPIDX_IN_MODEL].[OPNAME]-[OPIDX_IN_PRE_OP].[OPNAME]-[input|output].[PARA_NAME].pt
suc as 1.LlamaModel-24.LlamaDecoderLayer-2.LlamaAttention-2.Linear-1.AllReduce-input.input.pt.

HuiGao-NV requested review from hlu1 and QiJune

June 13, 2025 03:35

HuiGao-NV requested a review from a team as a code owner

June 13, 2025 03:35

HuiGao-NV force-pushed the debug_hook branch from 864d458 to 8af04fa Compare

June 13, 2025 03:43

HuiGao-NV added 4 commits

June 13, 2025 05:43


          Add debug hook

47e9be9

Add context manager method to enable debugger

Signed-off-by: Hui Gao <huig@nvidia.com>


          Add comments

dc10b8e

Signed-off-by: Hui Gao <huig@nvidia.com>


          Remove code from test

ee97e6c

Signed-off-by: Hui Gao <huig@nvidia.com>


          Add test case and disable dump_tensor by default

8dbc872

Signed-off-by: Hui Gao <huig@nvidia.com>

HuiGao-NV force-pushed the debug_hook branch from 8af04fa to 8dbc872 Compare

June 13, 2025 05:44

HuiGao-NV requested a review from juney-nvidia

June 14, 2025 12:23

QiJune reviewed

View reviewed changes

tensorrt_llm/_torch/debug/debug_hook.py

    
                      self.layer_inner_counter = []

                      self.module_forward_hook_handle = None

                      self.module_forward_pre_hook_handle = None

Collaborator

QiJune Jun 16, 2025

Since the outputs of layer n can be the inputs of layer n+1, do we need to dump the inputs and outputs of all layers at the same time? I guess there will be lots of duplicated results.

QiJune reviewed

View reviewed changes

tensorrt_llm/_torch/debug/debug_hook.py

    
                      self.log_folder = dest_folder

                      self.is_forward_pre = True

                      self.dump_style: DumpStyle = DumpStyle.BINARY

                      self.log_folder_inited: bool = False

Collaborator

QiJune Jun 16, 2025

Let's unifiy naming style, is_forward_pre, is_log_folder_inited

tensorrt_llm/_torch/debug/debug_hook.py

    
                          return

                      if self.log_folder is None:

                          import os

Collaborator

QiJune Jun 16, 2025

Please move import os to the beginning of the file.

QiJune reviewed

View reviewed changes

tensorrt_llm/_torch/debug/debug_hook.py

    
                  debug_ctx.get_current_modules_tree().clear()

                  debug_ctx.get_module_indices_tree().clear()

                  for name, submodule in model.named_modules():

Collaborator

QiJune Jun 16, 2025

If a MLP module contains 2 Linear modules, then, the output of MLP module should be the same with the output of the second Linear module. Is my understanding correct? If so, there will be duplicated dumpled tensors.

QiJune reviewed

View reviewed changes

tensorrt_llm/_torch/debug/debug_hook.py

    
              #                                   position_ids=position_ids,

              #                                   attn_metadata=attn_metadata)

              @contextmanager

              def debugger_addon(model, dest_folder: Optional[str] = None, filter=None):

Collaborator

QiJune Jun 16, 2025

How about renaming debugger_addon to debug_mode?

QiJune reviewed

View reviewed changes

tensorrt_llm/_torch/debug/debug_hook.py

    
                      tensor_counter += 1

                      module_path = "-".join([module_path, tensor_name])

                      from pathlib import Path

Collaborator

QiJune Jun 16, 2025

Please move from pathlib import Path to the begining of this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet