Relative Quality Of Local Models (Open Discussion) #48

Complexity · 2025-09-29T08:00:59Z

Complexity
Sep 29, 2025

Report

The recommended model (qwen2.5vl:3b) is sometimes prone to hallucination, both in task tracking but also app use, sometimes inventing tasks using apps that are not active or open. (This may also be due to using larger screen displays which makes text less legible.)

Discovery

Replacing this with qwen2.5vl:7b (on a 32MB M1) has produced much more reliable tracking of tasks, with better text recognition (brand names, site names, task content) and more dependable assumptions about task flow. If you can spare the memory, this is a good choice for running local models, at least via Ollama.

How To Change Your Local Model?

Reset onboarding, download a different model, target this in settings under "Other". See here for more instructions: #38

It would be interesting to compare other vision-capable models for their reliability, both via Ollama and LM Studio. It would also be interesting to compare Ollama and LM Studio for their relative power usage. Please share your experiences.

Experience

Macbook M1/32GB

Ollama - qwen2.5vl:3b
Some hallucination (apps cited when not in use, tasks invented). Summaries are somewhat generic and reasonably accurate 75% of the time. Memory: TBC, 12 hr Power Impact Ollama TBC, Dayflow TBC

Ollama - qwen2.5vl:7b
Accurate OCR (correct apps, websites and/or brands recognised and recorded), better task recognition. More detailed, accurate summaries. Memory: 6.56 GB, 12 hr Power Impact Ollama 25.13, Dayflow 4.83

JerryZLiu · 2025-09-30T20:17:35Z

JerryZLiu
Sep 30, 2025
Maintainer

Thanks for doing this eval! Super helpful! I also saw similar results, but wasn't sure if 7 GB of RAM was too much for most users. Would love to hear other people's experiences with other models as well.

0 replies

mikehardy · 2025-10-04T01:24:33Z

mikehardy
Oct 4, 2025

I have enough free VRAM on a different machine in my house to try 7b vs 3b and it hallucinates less but still isn't perfect.
Is enough of a quality bump to be worth mentioning if people want to make the tradeoff though, thanks for pointing it out @Complexity

0 replies

yellelieder · 2025-10-06T09:43:11Z

yellelieder
Oct 6, 2025

I have tried 7b and 3b, could not notice a relevant difference. Both produce unusable results for me. Performance was not an issue, but the hallucination was just from another planet. Firstly, everything was strictly categorized into work, but even more importantly, apps were constantly being invented that had nothing to do with what I was working on.
Unfortunately, I still can't get it to work with Gemin.

1 reply

JerryZLiu Oct 7, 2025
Maintainer

Ah that's a bummer. Would it be possible to share some debug logs after running gemini for 30 minutes? Would love to get to the bottom of why it isn't working for you.

JerryZLiu · 2025-10-15T08:16:21Z

JerryZLiu
Oct 15, 2025
Maintainer

Hey everyone, the Qwen 3VL family of models just dropped, and it includes 4B and 8B variations. The benchmarks look extremely promising. I will be testing them out and hopefully we can move to using these as defaults, which will greatly improve quality.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relative Quality Of Local Models (Open Discussion) #48

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Relative Quality Of Local Models (Open Discussion) #48

Uh oh!

Complexity Sep 29, 2025

Report

Discovery

How To Change Your Local Model?

Experience

Replies: 4 comments · 1 reply

Uh oh!

JerryZLiu Sep 30, 2025 Maintainer

Uh oh!

mikehardy Oct 4, 2025

Uh oh!

yellelieder Oct 6, 2025

Uh oh!

JerryZLiu Oct 7, 2025 Maintainer

Uh oh!

JerryZLiu Oct 15, 2025 Maintainer

Complexity
Sep 29, 2025

Replies: 4 comments 1 reply

JerryZLiu
Sep 30, 2025
Maintainer

mikehardy
Oct 4, 2025

yellelieder
Oct 6, 2025

JerryZLiu Oct 7, 2025
Maintainer

JerryZLiu
Oct 15, 2025
Maintainer