When using llamafactory-cli for training inside the container, the GPU memory usage shown by nvidia-smi inside the container is inconsistent with that on the host machine, and the GPU memory usage cannot be controlled.

**Please provide an in-depth description of the question you have**:

When using llamafactory-cli for training inside the container, the GPU memory usage shown by nvidia-smi inside the container is inconsistent with that on the host machine, and the GPU memory usage cannot be controlled. 

In addition, I don’t see the training PID inside the container. Could you tell me why this is happening?

container：

<img width="912" height="392" alt="Image" src="https://github.com/user-attachments/assets/a2d51589-60ea-461a-a850-d601de573666" />

node：

<img width="1270" height="574" alt="Image" src="https://github.com/user-attachments/assets/5bc918a8-4de4-4fae-9404-ad32cf9a94e2" />


**Environment**:
- HAMi version: v2.6.1
- Kubernetes version: v1.33.2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When using llamafactory-cli for training inside the container, the GPU memory usage shown by nvidia-smi inside the container is inconsistent with that on the host machine, and the GPU memory usage cannot be controlled. #1322

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When using llamafactory-cli for training inside the container, the GPU memory usage shown by nvidia-smi inside the container is inconsistent with that on the host machine, and the GPU memory usage cannot be controlled. #1322

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions