Skip to content

Performance Issues running NVILA #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
UelisonSantos opened this issue Apr 22, 2025 · 1 comment
Open

Performance Issues running NVILA #230

UelisonSantos opened this issue Apr 22, 2025 · 1 comment

Comments

@UelisonSantos
Copy link

Hello everyone, thanks for sharing this work.

I am trying to benchmark it using a different dataset/task. For now, I am more concerned about the latency numbers.

Image

I am comparing on the same machine with an A100 80GB GPU, Qwen2VL, and NVILA.
Based on the results from the paper, I would expect a significant speed increase from Qwen, but I am not seeing this.

I did the same experiments using the provided vila-infer and got the same numbers.

The only thing I noticed on my environment, is those two warnings:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False

Could Flash attention be the problem? Any clues on how to fix it?

Thank you,
Uélison

@zhijian-liu
Copy link
Collaborator

Latency can vary significantly depending on factors like the number of image tokens and output tokens. In our paper, we made sure to align both the number of image tokens and output tokens as closely as possible when comparing against other baselines.

Could you share more details about your specific benchmarking setup? For example, Qwen2-VL includes a configurable setting that can drastically change the number of visual tokens it generates. Similarly, the NVILA-lite models tend to produce significantly fewer tokens overall, which can also impact latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants