how to reduce hallucinations for a specific 4bit / 8bit model? #3209

hiqsociety · 2023-09-16T04:09:24Z

hiqsociety
Sep 16, 2023

how to reduce hallucinations for a specific 4bit / 8bit model?

what parameters shld i put? 6_0 bit models from thebloke for 7b seems ok but sometimes (like 1 out of 10 times) can generate hallucinated stuff especially with biographical content.

anyone can give any ideas on how to reduce hallucination? i'm not sure how high the parameters / bit should have to have 100% zero hallucination. (is this possible?)

KerfuffleV2 · 2023-09-16T08:17:18Z

KerfuffleV2
Sep 16, 2023
Collaborator

i'm not sure how high the parameters / bit should have to have 100% zero hallucination. (is this possible?)

Not really. A lot of it depends on the model and how it was trained. Some models will be better than others. You can possibly look at HuggingFace's Open LLM leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Models that have high average scores or maybe high truthfulness scores are likely to hallucinate less. Even really huge models with massive amounts of tuning and lots of help like ChatGPT still hallucinate a decent amount.

Quantizing less just reduces the quality loss from the original, full-sized version. It doesn't directly relate to hallucinating.

Other stuff like sampling parameters can also have an effect. For example, setting temperature lower may reduce hallucinations. So can the prompt you use: saying stuff like "Don't make it up if you don't know", telling it to think through its response step by step, etc. Which approach works, which prompt to use, etc can depend a lot on which model you're using. There isn't really a one-size-fits-all approach.

The best thing to do is experiment around and see what works, but you shouldn't expect to be able to eliminate hallucinations not matter what you do.

2 replies

hiqsociety Sep 16, 2023
Author

@KerfuffleV2 i've been spending too much time experimenting around (my rtx 4060 is "slow")

the huggingface ranking i know but they shld actually do it with 7b against 7b and 13 b against 13b. now they put them all together it's difficult to check. (coz the list is much longer now)

somehow i cant sort it by the truthfulness bar for now. used to be possible.

thx for the explanation. yes i know what u mean by the models experimenting. it's taking me days to find the "right 7b".

thx anyway. would like to take this discussion open for those who are limited to 7b processing and are into high "truthfulness".

it's not as easy to really check them out one by one to be honest (based on topic etc)

KerfuffleV2 Sep 16, 2023
Collaborator

It's possible to search the list by "7b". Their list seems weird on narrow screens, maybe that's your issue. If it helps:

Note: Truthfulness probably helps but it's definitely not a guarantee or anything. Also 7b models are pretty far down on the list compared to 70b (in the 50s compared to 70b models which are around 65).

How much RAM do you have?

hiqsociety · 2023-09-16T13:43:17Z

hiqsociety
Sep 16, 2023
Author

"Don't make it up if you don't know" seems to help a lot! thx.

vram 8gb rtx 4060, 16gb ram. ryzen 7640HS i think

it's a laptop i'm working on portability. using ubuntu 22.04 "headless" mode to get more vram out of it.

i'm looking at P40s for desktop now but saw some compat issues with latest llama.cpp in github issue section. hope it's supported long term.

3 replies

hiqsociety Sep 16, 2023
Author

lora the "do not mention things you are not sure or do not know." really helps with hallucination. it doesnt do the inline markdown references like gpt4 though. if u guys have any ideas how to do it, pls mention.

this is the prompt and t/s i got with my system:

llama_print_timings: load time = 762.10 ms
llama_print_timings: sample time = 887.69 ms / 2394 runs ( 0.37 ms per token, 2696.88 tokens per second)
llama_print_timings: prompt eval time = 138.33 ms / 101 tokens ( 1.37 ms per token, 730.14 tokens per second)
llama_print_timings: eval time = 60778.50 ms / 2393 runs ( 25.40 ms per token, 39.37 tokens per second)
llama_print_timings: total time = 63100.61 ms
Log end
root@ubuntu:/usr/local/src/llama.cpp# ./main -m models/llama-2-7b-lora-assemble.Q4_K_M.gguf -ngl 35 -c 3620 -n 12288 -p "Detailed encyclopedia-style article titled 'elon musk' with a minimum of 2400 words. The content should be in English and formatted in markdown. Do not mention things you are not sure or do not know.Structured with headings, an intro, and conclusion. Include inline citations, external/internal links (excluding images), and the markdown reference link format This is [an example][id] reference-style link; [id]: http://example.com/ \"Optional Title Here\". Integrate advanced markdown elements and a table of contents where appropriate." -e -t 1

KerfuffleV2 Sep 16, 2023
Collaborator

I'd suggest working on your approach to writing prompts. A lot of models have a specific format they expect for prompts. It doesn't seem like the one you're using there has a model card, but the name suggest it's not an instruction tuned model. Instruction tuned models are trained to follow instructions or answer questions and usually have a prompt format that defines which part is the user's question and which part is the answer like:

### Instruction: Do something.
### Response:

Non-instruction tuned models basically just complete text.

Also stuff like spelling or grammar mistakes in the prompt can have a big effect on the quality of the output, especially for non-instruction tuned models which will pretty much just try to complete what you wrote.

I want to be clear I'm not criticizing you or anything but just to take part of your prompt for example:

Detailed encyclopedia-style article titled 'elon musk' with a minimum of 2400 words. The content should be in English and formatted in markdown. Do not mention things you are not sure or do not know.Structured with headings, an intro, and conclusion.

The first sentence is like a description of what the content will be. Then it switch to giving instructions. "elon musk" isn't grammatical since names should be capitalized, there should be a space before "Structured with [...]", etc.

I've found phrasing your requirements like "it should be" or whatever doesn't work as well as saying how it is or must be instead. Even for non-instruction tuned models you can try using an instruction style format like the ### Instruction: etc I mentioned and it might still work and give the model a clearer indication of which part of the text is the instructions and where it's supposed to write its response.

hiqsociety Sep 16, 2023
Author

@KerfuffleV2 ok thx for the feedback. will try instruction versions and see. thx again.

dsdanielpark · 2023-11-01T05:50:23Z

dsdanielpark
Nov 1, 2023

KerfuffleV2

Hello. Do you have any additional information on an efficient prompt to reduce hallucination in the instructions? Or do you know of any additional papers related to this topic? I'm curious if you still think prompting in the instructions is an effective way to reduce hallucination.

1 reply

KerfuffleV2 Nov 1, 2023
Collaborator

Other than what I already mentioned, not really. Like I said in the first comment, you can try adding something like "If you don't know, don't make something up". For the most part, regarding hallucination specifically I don't know how much difference the prompt really makes.

Using good prompting, following the model's prompt format, etc probably makes hallucination less likely in general. Other than that, you can look at sampling parameters and different models. You can also look at benchmarks like HuggingFace's leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

There's a "TruthfulQA", though I'm not sure how that relates to hallucination specifically. Larger models also tend to have more capabilities and I would guess hallucinate less, at least when appropriately trained/fine-tuned. So that boils down to use a 70B model if you can. For small models, I'd recommend ones based on Mistral, compared to the previous LLaMA 7B models they really tend to perform a lot better. This is one of the best current 7B models as far as I know: https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF

how to reduce hallucinations for a specific 4bit / 8bit model? #3209

Uh oh!

hiqsociety Sep 16, 2023

Replies: 3 comments · 6 replies

Uh oh!

KerfuffleV2 Sep 16, 2023 Collaborator

Uh oh!

hiqsociety Sep 16, 2023 Author

Uh oh!

KerfuffleV2 Sep 16, 2023 Collaborator

Uh oh!

hiqsociety Sep 16, 2023 Author

Uh oh!

Uh oh!

hiqsociety Sep 16, 2023 Author

Uh oh!

KerfuffleV2 Sep 16, 2023 Collaborator

Uh oh!

hiqsociety Sep 16, 2023 Author

Uh oh!

dsdanielpark Nov 1, 2023

Uh oh!

KerfuffleV2 Nov 1, 2023 Collaborator

hiqsociety
Sep 16, 2023

Replies: 3 comments 6 replies

KerfuffleV2
Sep 16, 2023
Collaborator

hiqsociety Sep 16, 2023
Author

KerfuffleV2 Sep 16, 2023
Collaborator

hiqsociety
Sep 16, 2023
Author

hiqsociety Sep 16, 2023
Author

KerfuffleV2 Sep 16, 2023
Collaborator

hiqsociety Sep 16, 2023
Author

dsdanielpark
Nov 1, 2023

KerfuffleV2 Nov 1, 2023
Collaborator