Replies: 2 comments
-
Hmm it was moved to discussion, maybe good like that. Though I posted it in enhancements to suggest supporting the model. |
Beta Was this translation helpful? Give feedback.
0 replies
-
That would be really nice to have, also I don't think that CLIP is that far from being finished |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Well, hate to tell it given llava is not even 100% completely implemented yet but the release of Sphinx sadly puts llava 1.5 into a corner.
It's also quite manageable in size, 4 q-bit would be just around 5GB in weights.
http://imagebind-llm.opengvlab.com/
I ran a few tests compared to llava 1.5, it didn't hallucinate, it mentioned small details llava can not see and could detect parts correctly llava mixed up.
I've not tested every type of scene, just a few quite difficult ones to specifically see if Sphinx can do stuff llava couldn't. And it was remarkable better.
It appears they have a refined image encoding process where they feed multiple encoders and multiple parts of the image, so the projector learns better about location of objects and it learns a general view of it as well as a detailed zoomed view of it.
That said, it should be possible to also train the llava-1.5 llm model like that.
Beta Was this translation helpful? Give feedback.
All reactions