-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Hello. This is a very good work and a comprehensive analysis of how to design a hybrid visual encoder, gave me a lot of inspiration. At the same time, after reading the paper, I have some questions about the pre-alignment training: 1.What is the difference between pre alignment training in stage one and pre training in stage two? Considering that both types of training unfreeze visual encoders and projectors. The loss functions of the two stages are also commonly used regression loss, that is, next-token-prediction supervision. 2. For pre-alignment training, why introduce additional smaller LLMs (Vicuna-7B in practice) for alignment instead of the original LLMs of the model, because they are all using 7B models. The meaning of this step is not very clear to me. Looking forward to your reply!