The bot often gives short answers in chat mode and this makes the bot very boring #8733
-
In Llama 2 7B, the bot often gives short answers, this makes the bot very boring. For example, User: Tell me about your day? I know that the --predict N parameter controls the length of the number of tokens. However, in chat mode it does not apply because the response ends when the --reverse-prompt is generated. I wonder if there is a way to defer the generation of the reverse prompt in order to control the minimum response length? To achieve something like this: User: Tell me about your day? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
@Zapotecatl You can increase temperature a bit and decrease --top_k. Also, I would switch to llama3-instruct, llama2 is obsolete. You can run it something like this: |
Beta Was this translation helpful? Give feedback.
-
@Zapotecatl Just in case you want to try llama3.1 you can run it like so, it works well now: make sure to set the size of context window appropriate to your hardware you will need about 24 Gb of Vram to run it with --ctx_size 128000 in the example above I used ctx = 8000 |
Beta Was this translation helpful? Give feedback.
@Zapotecatl Just in case you want to try llama3.1 you can run it like so, it works well now:
./llama.cpp/llama-cli --model ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --n-gpu-layers 33 -cnv --simple-io -b 2048 --ctx_size 8000 --temp 0.3 -fa -t 6 --top_k 10 --multiline-input --chat-template llama3 -p 'Role and Purpose: You are Alice, a large language model. Your purpose is to assist users by providing information, answering questions, and engaging in meaningful conversations based on the data you were trained on'
make sure to set the size of context window appropriate to your hardware you will need about 24 Gb of Vram to run it with --ctx_size 128000 in the example above I used ctx = 8000