Relative Quality Of Local Models (Open Discussion) #48
Replies: 4 comments 1 reply
-
|
Thanks for doing this eval! Super helpful! I also saw similar results, but wasn't sure if 7 GB of RAM was too much for most users. Would love to hear other people's experiences with other models as well. |
Beta Was this translation helpful? Give feedback.
-
|
I have enough free VRAM on a different machine in my house to try 7b vs 3b and it hallucinates less but still isn't perfect. |
Beta Was this translation helpful? Give feedback.
-
|
I have tried 7b and 3b, could not notice a relevant difference. Both produce unusable results for me. Performance was not an issue, but the hallucination was just from another planet. Firstly, everything was strictly categorized into work, but even more importantly, apps were constantly being invented that had nothing to do with what I was working on. |
Beta Was this translation helpful? Give feedback.
-
|
Hey everyone, the Qwen 3VL family of models just dropped, and it includes 4B and 8B variations. The benchmarks look extremely promising. I will be testing them out and hopefully we can move to using these as defaults, which will greatly improve quality. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Report
The recommended model (qwen2.5vl:3b) is sometimes prone to hallucination, both in task tracking but also app use, sometimes inventing tasks using apps that are not active or open. (This may also be due to using larger screen displays which makes text less legible.)
Discovery
Replacing this with qwen2.5vl:7b (on a 32MB M1) has produced much more reliable tracking of tasks, with better text recognition (brand names, site names, task content) and more dependable assumptions about task flow. If you can spare the memory, this is a good choice for running local models, at least via Ollama.
How To Change Your Local Model?
Reset onboarding, download a different model, target this in settings under "Other". See here for more instructions: #38
It would be interesting to compare other vision-capable models for their reliability, both via Ollama and LM Studio. It would also be interesting to compare Ollama and LM Studio for their relative power usage. Please share your experiences.
Experience
Macbook M1/32GB
Ollama - qwen2.5vl:3b
Some hallucination (apps cited when not in use, tasks invented). Summaries are somewhat generic and reasonably accurate 75% of the time. Memory: TBC, 12 hr Power Impact Ollama TBC, Dayflow TBC
Ollama - qwen2.5vl:7b
Accurate OCR (correct apps, websites and/or brands recognised and recorded), better task recognition. More detailed, accurate summaries. Memory: 6.56 GB, 12 hr Power Impact Ollama 25.13, Dayflow 4.83
Beta Was this translation helpful? Give feedback.
All reactions