Skip to content

Does Llama 3.2 vision use CLIP? #2462

Answered by pbontrager
vedantroy asked this question in Q&A
Discussion options

You must be logged in to vote
  1. The base vision encoder in Llama3.2 is a variant of OpenAI CLIP. So we implement the generic model and then in the Llama3.2 encoder builder use the hyperparameters used by the Llama team and not the original CLIP model. The Llama CLIP is pre-trained in the same way as OpenAI CLIP and then further post trained with end to end with the encoder head and decoder.

  2. The diagram looks good where I'm assuming that "t Transformer" is the transformer inside of Llama3VisionProjectionHead. Also the cross attention with the decoder only happens on certain layers of the decoder. If you want to dig even deeper into how it works you can look at how images get broken into tiles which CLIP converts into…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by pbontrager
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants