Skip to content

Add debug hook to support dump tensor data and add new debug functions easily #5182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

HuiGao-NV
Copy link
Collaborator

@HuiGao-NV HuiGao-NV commented Jun 13, 2025

Add debug hook to support dump tensor data and add new debug functions easily.

To enable to dump tensors' data,
from tensorrt_llm._torch.debug.debug_hook import debugger_addon, register_tensor_dump_hook
with debugger_addon(model, DATA_FOLDER):
register_tensor_dump_hook()
model.forward()

The dumped data are put under DATA_FOLDER/rank[ID]/....
The data file name is in the pattern:
[LOOP_COUNT].[model_name]-[OPIDX_IN_MODEL].[OPNAME]-[OPIDX_IN_PRE_OP].[OPNAME]-[input|output].[PARA_NAME].pt
suc as 1.LlamaModel-24.LlamaDecoderLayer-2.LlamaAttention-2.Linear-1.AllReduce-input.input.pt.

@HuiGao-NV HuiGao-NV requested review from hlu1 and QiJune June 13, 2025 03:35
@HuiGao-NV HuiGao-NV requested a review from a team as a code owner June 13, 2025 03:35
Add context manager method to enable debugger

Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
self.layer_inner_counter = []

self.module_forward_hook_handle = None
self.module_forward_pre_hook_handle = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the outputs of layer n can be the inputs of layer n+1, do we need to dump the inputs and outputs of all layers at the same time? I guess there will be lots of duplicated results.

self.log_folder = dest_folder
self.is_forward_pre = True
self.dump_style: DumpStyle = DumpStyle.BINARY
self.log_folder_inited: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's unifiy naming style, is_forward_pre, is_log_folder_inited

return

if self.log_folder is None:
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move import os to the beginning of the file.


debug_ctx.get_current_modules_tree().clear()
debug_ctx.get_module_indices_tree().clear()
for name, submodule in model.named_modules():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a MLP module contains 2 Linear modules, then, the output of MLP module should be the same with the output of the second Linear module. Is my understanding correct? If so, there will be duplicated dumpled tensors.

# position_ids=position_ids,
# attn_metadata=attn_metadata)
@contextmanager
def debugger_addon(model, dest_folder: Optional[str] = None, filter=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming debugger_addon to debug_mode?


tensor_counter += 1
module_path = "-".join([module_path, tensor_name])
from pathlib import Path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move from pathlib import Path to the begining of this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants