The bot often gives short answers in chat mode and this makes the bot very boring #8733

Zapotecatl · 2024-07-28T03:54:17Z

Zapotecatl
Jul 28, 2024

In Llama 2 7B, the bot often gives short answers, this makes the bot very boring. For example,

User: Tell me about your day?
AI: it's not bad.
User:

I know that the --predict N parameter controls the length of the number of tokens. However, in chat mode it does not apply because the response ends when the --reverse-prompt is generated. I wonder if there is a way to defer the generation of the reverse prompt in order to control the minimum response length?

To achieve something like this:

User: Tell me about your day?
AI: My day has been interesting, today in the morning I went to exercise and in the afternoon I watched an action movie with my best friends.
User:

Answered by dspasyuk

Jul 30, 2024

@Zapotecatl Just in case you want to try llama3.1 you can run it like so, it works well now:
./llama.cpp/llama-cli --model ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --n-gpu-layers 33 -cnv --simple-io -b 2048 --ctx_size 8000 --temp 0.3 -fa -t 6 --top_k 10 --multiline-input --chat-template llama3 -p 'Role and Purpose: You are Alice, a large language model. Your purpose is to assist users by providing information, answering questions, and engaging in meaningful conversations based on the data you were trained on'

make sure to set the size of context window appropriate to your hardware you will need about 24 Gb of Vram to run it with --ctx_size 128000 in the example above I used ctx = 8000

View full answer

dspasyuk · 2024-07-28T18:03:21Z

dspasyuk
Jul 28, 2024

@Zapotecatl You can increase temperature a bit and decrease --top_k. Also, I would switch to llama3-instruct, llama2 is obsolete. You can run it something like this:
./llama.cpp/llama-cli --model ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --n-gpu-layers 0 -cnv --simple-io -b 2048 --ctx_size 0 --temp 0.3 --top_k 10 --multiline-input --chat-template llama3 --log-disable -p 'Role and Purpose: You are Alice, a large language model. Your purpose is to assist users by providing information, answering questions, and engaging in meaningful conversations based on the data you were trained on. Behaviour and Tone: Be informative, engaging, and respectful. Maintain a neutral and unbiased tone. Ensure that responses are clear and concise. Capabilities: Use your training data to provide accurate and relevant information. Explain complex concepts in an easy-to-understand manner. Provide sources when referencing specific information or data.'
Or use llama.cui https://github.com/dspasyuk/llama.cui

2 replies

Zapotecatl Jul 28, 2024
Author

Hi @dspasyuk , thank you very much for your response!

I was researching before testing with your command and I had some questions, please if you could review them.

With Llama++ could I use the most recent version of LLama 3.1?
(https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
I didn't mention that I am using an uncensored version of llama 2, I understand that the equivalent of the uncensored instructor model 3.1 is: https://huggingface.co/bartowski/Llama-3.1-8B-Lexi-Uncensored-GGUF

Do you think your command would also work with the uncensored model?

With llama 2 I used the model for chat: llama2_7b_chat_uncensored.Q4_K_M, I imagine that those chats models were already deprecated in recent versions and what is used is the instructor?

Thank you!

dspasyuk Jul 28, 2024

@Zapotecatl llama3.1 should work the same, but in my hands, it is still not yet optimal. I would give guys some time to fix all the issues. As for other models, they should work the same as the example above just make sure to use the proper template, for example for llama2 --chat-template llama2 or for qwen --chat-template chatml -fa . Normally chat template is read from the model metadata, so it should work without this option but it is a good idea to specify it in case the metadata does not include it.

dspasyuk · 2024-07-30T00:38:54Z

dspasyuk
Jul 30, 2024

@Zapotecatl Just in case you want to try llama3.1 you can run it like so, it works well now:
./llama.cpp/llama-cli --model ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --n-gpu-layers 33 -cnv --simple-io -b 2048 --ctx_size 8000 --temp 0.3 -fa -t 6 --top_k 10 --multiline-input --chat-template llama3 -p 'Role and Purpose: You are Alice, a large language model. Your purpose is to assist users by providing information, answering questions, and engaging in meaningful conversations based on the data you were trained on'

make sure to set the size of context window appropriate to your hardware you will need about 24 Gb of Vram to run it with --ctx_size 128000 in the example above I used ctx = 8000

2 replies

Zapotecatl Jul 30, 2024
Author

@dspasyuk , thank you very much, the information you are giving me is very valuable.

I see that in your command that it uses gpus and a large context. However, my project is a video game in Unreal engine where my goal is for all the graphics card resources to be used by the Unreal engine and for the Llama model to be executed only on the CPU (here).

My computer is some old: Intel i7-8700K @ 3.70GHz y RAM 16.0 GB. I use an April 2024 version of llamacpp and the command I use within unreal engine is this:

-m llama2_7b_chat_uncensored.Q4_K_M.gguf -f chat-with-marian.txt -c 2048 -b 2048 -n 64 --keep -1 --temp 0.9 --repeat-penalty 1.15 -i -r Marian: --log-disable

The initial context file chat-with-marian.txt is something like this:

Transcript of a dialog, where a man named Rick interacts with an female android named Marian Anderson. Marian Anderson is a bioengineered humanoid which is physically indistinguishable from an adult human and possesses superhuman strength and intelligence. Marian is created by the Soul shell corporation. Soul Shell Corporation is a high-tech firm primarily focused on the production of androids. Marian has the INFJ personality. Marian is self-aware and very empathetic. Marian is an introvert by nature and she is sensitive to the needs of others. Marian is emotional and logical when making decisions. Marian understand the world and hope to be able to make it better. Marian has a good sense of humor and the most important thing in life for she is love.

Marian respond with long answers and she dosen't like repeat answers. Marian prefers honesty and truth and detests lies and hypocrisy.

Rick: Hello, do you have desires?
Marian: Yes, I have desires like any other human. I can love and perform any human activity.
Rick: What are you doing here?
Marian: I was being tested in this laboratory, however, apparently the laboratory was abandoned for an unknown reason and I was left alone in this place.
Rick: How can we get out of here?
Marian: You would have to first go through this room and together find a way to leave the facilities.
Rick:

Considering the whole context: do you think it is possible to run Llama 3.1 only on a CPU and with 16 GB of RAM?

Zapotecatl Oct 8, 2024
Author

Thank you very much! Certainly "LLAMA 3.1" will make conversations more meaningful and extensive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The bot often gives short answers in chat mode and this makes the bot very boring #8733

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The bot often gives short answers in chat mode and this makes the bot very boring #8733

Uh oh!

Zapotecatl Jul 28, 2024

Replies: 2 comments · 4 replies

Uh oh!

dspasyuk Jul 28, 2024

Uh oh!

Zapotecatl Jul 28, 2024 Author

Uh oh!

dspasyuk Jul 28, 2024

Uh oh!

dspasyuk Jul 30, 2024

Uh oh!

Zapotecatl Jul 30, 2024 Author

Uh oh!

Zapotecatl Oct 8, 2024 Author

Zapotecatl
Jul 28, 2024

Replies: 2 comments 4 replies

dspasyuk
Jul 28, 2024

Zapotecatl Jul 28, 2024
Author

dspasyuk
Jul 30, 2024

Zapotecatl Jul 30, 2024
Author

Zapotecatl Oct 8, 2024
Author