Skip to content

Placeholder tracing segmentation faults #9049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rpsilva-aws opened this issue Apr 28, 2025 · 3 comments
Open

Placeholder tracing segmentation faults #9049

rpsilva-aws opened this issue Apr 28, 2025 · 3 comments
Labels
enhancement New feature or request runtime

Comments

@rpsilva-aws
Copy link
Collaborator

rpsilva-aws commented Apr 28, 2025

🐛 Bug

There is a segmentation fault when attempting to stage the HLO dump that involve placeholder tensors. The issue seems to originate from the PjRt computation client, as it attempts to execute the computation. However, for placeholder tensors, it requires the buffers to be present from a given PjRt data, which is not the case for placeholders. We should not be doing any materialization when tracing and returning the HLO, so this might be an undesired behavior.

To Reproduce

import torch
import torch_xla
from torch_xla.core.xla_builder import create_placeholder_tensor

shape = (32,32)
dtype = torch.float32
p = create_placeholder_tensor(shape, dtype)
result = p + 1.0
print(torch_xla._XLAC._get_xla_tensors_hlo(result))

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: CPU
  • torch_xla version: 2.6+

Additional context

@ysiraichi ysiraichi added bug Something isn't working runtime enhancement New feature or request and removed bug Something isn't working labels Apr 29, 2025
@ysiraichi
Copy link
Collaborator

I'm not really familiar with the innards of PjRt, but is this undesired behavior, though? My guess is that _get_xla_tensors_hlo() was implemented under the assumption that the parameters were not placeholders.

Also, I don't think we are materializing anything when tracing. That said, I agree that, in theory, we wouldn't need to materialize the tensors just for generating the HLO. In this sense, I think this is more a feature request than a bug.

@rpsilva-aws
Copy link
Collaborator Author

rpsilva-aws commented Apr 29, 2025

I meant that it is part of the computation execution, which was somewhat surprising since the intention was to only trace all participating XLA tensors and their respective IDs. There is probably a good reason to also exercise this path. Do you have a possible idea?

We fail here:

XLA_CHECK(pjrt_device == pjrt_data->buffer->device())

Since the placeholder tensors do not have a PjRt buffer (even empty might not work, as it derives it from the shape), then there is some undesired semantics here imo. This would require a bit more investigation.

I agree and understand the point, but at the same time, it is a new feature that is hitting a seg fault when used with existing APIs, which is more concerning. Perhaps the question is more on why do we execute the computation in the API. I think we can shift it to a feature request if we first introduce an xcheck that captures the unsupported case. I could raise a PR with a simple: XLA_CHECK(pjrt_data->buffer != nullptr)

@rpsilva-aws
Copy link
Collaborator Author

rpsilva-aws commented Apr 30, 2025

@ysiraichi #9062 to at least prevent the seg fault, since the use case here seems valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request runtime
Projects
None yet
Development

No branches or pull requests

2 participants