You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to benchmark it using a different dataset/task. For now, I am more concerned about the latency numbers.
I am comparing on the same machine with an A100 80GB GPU, Qwen2VL, and NVILA.
Based on the results from the paper, I would expect a significant speed increase from Qwen, but I am not seeing this.
I did the same experiments using the provided vila-infer and got the same numbers.
The only thing I noticed on my environment, is those two warnings:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=False
Could Flash attention be the problem? Any clues on how to fix it?
Thank you,
Uélison
The text was updated successfully, but these errors were encountered:
Latency can vary significantly depending on factors like the number of image tokens and output tokens. In our paper, we made sure to align both the number of image tokens and output tokens as closely as possible when comparing against other baselines.
Could you share more details about your specific benchmarking setup? For example, Qwen2-VL includes a configurable setting that can drastically change the number of visual tokens it generates. Similarly, the NVILA-lite models tend to produce significantly fewer tokens overall, which can also impact latency.
Hello everyone, thanks for sharing this work.
I am trying to benchmark it using a different dataset/task. For now, I am more concerned about the latency numbers.
I am comparing on the same machine with an A100 80GB GPU, Qwen2VL, and NVILA.
Based on the results from the paper, I would expect a significant speed increase from Qwen, but I am not seeing this.
I did the same experiments using the provided vila-infer and got the same numbers.
The only thing I noticed on my environment, is those two warnings:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
.The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use
mean_resizing=False
Could Flash attention be the problem? Any clues on how to fix it?
Thank you,
Uélison
The text was updated successfully, but these errors were encountered: