n_tokens <= n_batch, beta, conversation history #176

scenaristeur · 2024-03-07T11:21:46Z

scenaristeur
Mar 7, 2024

for the issue of @scenaristeur scenaristeur mentioned this pull request Mar 4, 2024
n_tokens <= n_batch
i have tried to migrate to "node-llama-cpp": "^3.0.0-beta.13", but not i have a crash on my laptop ideapad (https://www.google.com/search?client=firefox-b-lm&q=ideapad+3+15alc6 ) (no GPU, AMD rizen 5000, 16 core CPU / 16GBram)
It worked like a charm with "node-llama-cpp": "^2.8.8", (i had no issue of memory apart n_tokens < n_batch with long ConversationHistory) but now it crashes even with a small conversationhistory with "radv/amdgpu: Not enough memory for command submission."

with this usage https://github.com/scenaristeur/igora/blob/node_llama_cpp_v3_beta/src/mcConnector/index.js

[node-llama-cpp] llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/smag/dev/igora/models/vicuna-7b-v1.5-16k.Q2_K.gguf (version GGUF V2)
[node-llama-cpp] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[node-llama-cpp] llama_model_loader: - kv   0:                       general.architecture str              = llama
[node-llama-cpp] llama_model_loader: - kv   1:                               general.name str              = lmsys_vicuna-7b-v1.5-16k
[node-llama-cpp] llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
[node-llama-cpp] llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
[node-llama-cpp] llama_model_loader: - kv   4:                          llama.block_count u32              = 32
[node-llama-cpp] llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
[node-llama-cpp] llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
[node-llama-cpp] llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
[node-llama-cpp] llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
[node-llama-cpp] llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[node-llama-cpp] llama_model_loader: - kv  10:                    llama.rope.scale_linear f32              = 4.000000
[node-llama-cpp] llama_model_loader: - kv  11:                          general.file_type u32              = 10
[node-llama-cpp] llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
[node-llama-cpp] llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
[node-llama-cpp] llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
[node-llama-cpp] llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
[node-llama-cpp] llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
[node-llama-cpp] llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
[node-llama-cpp] llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
[node-llama-cpp] llama_model_loader: - kv  19:               general.quantization_version u32              = 2
[node-llama-cpp] llama_model_loader: - type  f32:   65 tensors
[node-llama-cpp] llama_model_loader: - type q2_K:   65 tensors
[node-llama-cpp] llama_model_loader: - type q3_K:  160 tensors
[node-llama-cpp] llama_model_loader: - type q6_K:    1 tensors
[node-llama-cpp] llm_load_vocab: special tokens definition check successful ( 259/32000 ).
[node-llama-cpp] llm_load_print_meta: format           = GGUF V2
[node-llama-cpp] llm_load_print_meta: arch             = llama
[node-llama-cpp] llm_load_print_meta: vocab type       = SPM
[node-llama-cpp] llm_load_print_meta: n_vocab          = 32000
[node-llama-cpp] llm_load_print_meta: n_merges         = 0
[node-llama-cpp] llm_load_print_meta: n_ctx_train      = 16384
[node-llama-cpp] llm_load_print_meta: n_embd           = 4096
[node-llama-cpp] llm_load_print_meta: n_head           = 32
[node-llama-cpp] llm_load_print_meta: n_head_kv        = 32
[node-llama-cpp] llm_load_print_meta: n_layer          = 32
[node-llama-cpp] llm_load_print_meta: n_rot            = 128
[node-llama-cpp] llm_load_print_meta: n_embd_head_k    = 128
[node-llama-cpp] llm_load_print_meta: n_embd_head_v    = 128
[node-llama-cpp] llm_load_print_meta: n_gqa            = 1
[node-llama-cpp] llm_load_print_meta: n_embd_k_gqa     = 4096
[node-llama-cpp] llm_load_print_meta: n_embd_v_gqa     = 4096
[node-llama-cpp] llm_load_print_meta: f_norm_eps       = 0.0e+00
[node-llama-cpp] llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
[node-llama-cpp] llm_load_print_meta: f_clamp_kqv      = 0.0e+00
[node-llama-cpp] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[node-llama-cpp] llm_load_print_meta: n_ff             = 11008
[node-llama-cpp] llm_load_print_meta: n_expert         = 0
[node-llama-cpp] llm_load_print_meta: n_expert_used    = 0
[node-llama-cpp] llm_load_print_meta: pooling type     = 0
[node-llama-cpp] llm_load_print_meta: rope type        = 0
[node-llama-cpp] llm_load_print_meta: rope scaling     = linear
[node-llama-cpp] llm_load_print_meta: freq_base_train  = 10000.0
[node-llama-cpp] llm_load_print_meta: freq_scale_train = 0.25
[node-llama-cpp] llm_load_print_meta: n_yarn_orig_ctx  = 16384
[node-llama-cpp] llm_load_print_meta: rope_finetuned   = unknown
[node-llama-cpp] llm_load_print_meta: model type       = 7B
[node-llama-cpp] llm_load_print_meta: model ftype      = Q2_K - Medium
[node-llama-cpp] llm_load_print_meta: model params     = 6.74 B
[node-llama-cpp] llm_load_print_meta: model size       = 2.63 GiB (3.35 BPW) 
[node-llama-cpp] llm_load_print_meta: general.name     = lmsys_vicuna-7b-v1.5-16k
[node-llama-cpp] llm_load_print_meta: BOS token        = 1 '<s>'
[node-llama-cpp] llm_load_print_meta: EOS token        = 2 '</s>'
[node-llama-cpp] llm_load_print_meta: UNK token        = 0 '<unk>'
[node-llama-cpp] llm_load_print_meta: LF token         = 13 '<0x0A>'
[node-llama-cpp] llm_load_tensors: ggml ctx size =    0.11 MiB
[node-llama-cpp] llm_load_tensors: offloading 0 repeating layers to GPU
[node-llama-cpp] llm_load_tensors: offloaded 0/33 layers to GPU
[node-llama-cpp] llm_load_tensors:        CPU buffer size =  2694.32 MiB

[node-llama-cpp] llama_new_context_with_model: n_ctx      = 4096
[node-llama-cpp] llama_new_context_with_model: freq_base  = 10000.0
[node-llama-cpp] llama_new_context_with_model: freq_scale = 0.25
[node-llama-cpp] llama_kv_cache_init: Vulkan_Host KV buffer size =  2048.00 MiB
[node-llama-cpp] llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
[node-llama-cpp] llama_new_context_with_model: Vulkan_Host input buffer size   =   192.08 MiB
[node-llama-cpp] llama_new_context_with_model: Vulkan_Host compute buffer size =  2304.00 MiB
[node-llama-cpp] llama_new_context_with_model: graph splits (measure): 1
radv/amdgpu: Not enough memory for command submission.
Segmentation fault (core dumped)

with V2.8.8 i got https://github.com/scenaristeur/igora/blob/3342a1a48172eae1d31489e33a64fe025e1cb522/src/mcConnector/index.js

Loading LLM model from /home/smag/dev/igora/models/vicuna-7b-v1.5-16k.Q2_K.gguf
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/smag/dev/igora/models/vicuna-7b-v1.5-16k.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = lmsys_vicuna-7b-v1.5-16k
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                    llama.rope.scale_linear f32              = 4.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:   65 tensors
llama_model_loader: - type q3_K:  160 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 2.63 GiB (3.35 BPW) 
llm_load_print_meta: general.name     = lmsys_vicuna-7b-v1.5-16k
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  2694.32 MiB

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.25
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.04 MiB
llama_new_context_with_model:        CPU compute buffer size =   288.00 MiB
llama_new_context_with_model: graph splits (measure): 1

and it works until token.length is about 300 with (328 is ok , 536 is ko)

    const context = new LlamaContext({ model, seed });

    // let tokens = enc.encode(JSON.stringify(options.conversationHistory))
    const tokens = context.encode(JSON.stringify(options.conversationHistory));
    console.log('TIKTOKEN length', tokens, tokens.length)

if more token, i get n_tokens <= n_batch

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:
import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false
});
Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:
npx --no node-llama-cpp inspect gpu

it's a 16 core CPU only, no GPU, i'll try getLlama with GPU false . thxs. Perharps i've istalled some Vulkan tools trying some llm but it's a CPU only

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:
import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false
});
Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:
npx --no node-llama-cpp inspect gpu

thxs ,
Works with "gpu:false", but i've lost conversationHistory, how to deal with conversationHistory in the beta version ? I'm working on a server where there can be multiple sessions, with multiple history, in what format should history be injected to a session ? to which class ?

scenaristeur · 2024-03-07T11:29:43Z

scenaristeur
Mar 7, 2024
Author

todo : explore : or integrate with : If your application is GPL 3.0 compliant, feel free to inspire yourself here as to how that can go: https://github.com/nathanlesage/local-chat

1 reply

giladgd Mar 9, 2024
Maintainer

To load a previous conversation into a LlamaChatSession, you can use the getChatHistory() and setChatHistory( ... ) functions:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = new LlamaModel({
    llama,
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

const chatHistory = session.getChatHistory();


session.dispose();
const session2 = new LlamaChatSession({
    contextSequence: context.getSequence()
});
session2.setChatHistory(chatHistory);

const q3 = "What was my last request?";
console.log("User: " + q3);

const a3 = await session2.prompt(q3);
console.log("AI: " + a3);

scenaristeur · 2024-03-07T15:30:36Z

scenaristeur
Mar 7, 2024
Author

@giladgd how does node-llama-cpp manage long conversation history if it is longer than the context of the model (with v3/beta ) ?

1 reply

giladgd Mar 9, 2024
Maintainer

It depends on the contextShift configuration you use when you create a LlamaChatSession.

By default, it uses the eraseFirstResponseAndKeepFirstSystem strategy that does the following:

It rewrites the chat history to keep the most recent interactions with the model - as much as can be fitted into the context size
it makes sure to keep the first system prompts of the chat at all times, to make sure the model responses follow it even if the chat gets very long

You can define a custom context shift strategy function to change this behavior if you like - such a strategy requires implementing a function that receives a chat history and returns a "compressed" version of it that can be fitted into the maxTokensCount number of tokens that this function also receives.

scenaristeur · 2024-03-12T17:21:12Z

scenaristeur
Mar 12, 2024
Author

thxs @giladgd for your example.
session.dispose throw an error

//session.dispose();
/*
file:///home/smag/dev/igora/node_modules/lifecycle-utils/dist/DisposeAggregator.js:67
throw new DisposedError();
^

DisposedError: Object is disposed

*/

and i add to create a new context. here is what works for me with gpu:false

// #176 (reply in thread)



import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama({gpu: false});
const model = new LlamaModel({
    llama,
    modelPath: path.join(__dirname, "../models", "dolphin-2.2.1-mistral-7b.Q2_K.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Je m'appelle David, je développe des application avec des llm et vuejs pour faire du front et j'ai 45 ans";


const a1 = await session.prompt(q1);
console.log("User: " + q1);
console.log("AI: " + a1);


const q2 = "Qui suis-je ?";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

const chatHistory = session.getChatHistory();

console.log(chatHistory);

//session.dispose();
/*
file:///home/smag/dev/igora/node_modules/lifecycle-utils/dist/DisposeAggregator.js:67
            throw new DisposedError();
                  ^

DisposedError: Object is disposed

*/

const context2 = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});

const session2 = new LlamaChatSession({
    contextSequence: context2.getSequence()
});
session2.setChatHistory(chatHistory);

const q3 = "Quel âge aurais-je dans 4 ans et quel est mon front-end préféré?";
console.log("User: " + q3);

const a3 = await session2.prompt(q3);
console.log("AI: " + a3);

const chatHistory2 = session2.getChatHistory();

console.log(chatHistory2);

I think it should be a good idea to be openai Api compatible, using role: system, role: user, role: assistant, and content for each
like https://community.openai.com/t/gpt-3-5-turbo-how-to-remember-previous-messages-like-chat-gpt-website/170370/6
as this is a most common pattern used in different tools i've used instead of type: model as console.log(show

[
  {
    type: 'system',
    text: 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible.\n' +
      "If a question does not make any sense, or is not factually coherent, explain why instead of answering something incorrectly. If you don't know the answer to a question, don't share false information."
  },
  {
    type: 'user',
    text: "Je m'appelle David, je développe des application avec des llm et vuejs pour faire du front et j'ai 45 ans"
  },
  {
    type: 'model',
    response: [
      " Bonjour David, c'est un plaisir de vous rencontrer. Vous êtes un développeur expérimenté qui travaille avec des outils tels que LlM et VueJS pour créer des interfaces utilisateur. Cela est très intéressant! Si vous avez besoin d'aide ou de conseils sur le développement, n'hésitez pas à me poser des questions."
    ]
  },
  { type: 'user', text: 'Qui suis-je ?' },
...

translation from one template to another was made with TemplateChatWrapper in v2 but i don't know if it's possible in v3 ?

2 replies

giladgd Mar 13, 2024
Maintainer

Thanks for reporting this bug, I'll release a new beta with the fix for it and add a test for this.

Regarding the OpenAI API, I plan to add support for that as part of a planned big overhaul of CatAI, but it'll take some time as I have some more things to finish on node-llama-cpp first.

scenaristeur Mar 14, 2024
Author

if this could help for OpenAi API https://github.com/scenaristeur/igora/blob/main/openai/openai.js.
It's just an ugly start

scenaristeur · 2024-03-12T17:41:27Z

scenaristeur
Mar 12, 2024
Author

and why is response an array with only one text ?

3 replies

giladgd Mar 13, 2024
Maintainer

A model response can also contain things other than just text, such as function calls.
Check out the TS types to see more details about that.

scenaristeur Mar 20, 2024
Author

I really think messages are more easy parsable, if they have the same structure. role: user/assistant, content: ""

And it's a good point if it is Openai compatible .
in openai "function calls" deprecated and replaced with "tools call" is in response too, with content null.

I am also agreeeeeeeee that you can not always follow the mainstream, as it can change one day to another, and it is sometimes hard to follow, and it's a good thing to have other alternative and propositions ;-)

https://platform.openai.com/docs/api-reference/chat

  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699896916,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\n\"location\": \"Boston, MA\"\n}"
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 82,
    "completion_tokens": 17,
    "total_tokens": 99
  }
}

giladgd Mar 20, 2024
Maintainer

The problem with the OpenAI API is that it works well only with their specific way of working with their own models.
node-llama-cpp can run many models that have different interfaces and work differently, so I designed the current API to be flexible enough to work well with all of them.

For example, the stateless OpenAI API doesn't keep the function calls in the history and assumes that each request would include the entire history, as it doesn't keep any context state for your conversation on their servers, so to generate continuation of the conversation you end up evaluating big part of your conversation again and again from scratch.

On node-llama-cpp, you have access to the model's context state, so it tries to append more tokens to it so it won't generate large parts of the history again and again from scratch.
To do that, it needs to have a logical representation of the data that's currently in the context, so it has to hold the entire chat history in detail, including the function calls.
Removing the function calls from the history would require evaluating the last response again, which is not desirable.

As you can see, the stateless OpenAI API is inefficient when you can hold a state.

I thought about keeping an internal state that contains the function calls in the history and exposing a simpler state that looks similar to OpenAI's, but I decided not to go that route as it makes the code more prone to bugs and performance issues.
For example, you may think you can save such a chat history aside and load it again to a LlamaChatSession, but doing so would cause the model to produce bad outputs.
Some of the models I've tested a history without function calls, generated bad outputs, as the history of the conversation did not make the same sense to them without the function calls, so the subsequent generations didn't call some functions when the model should have done so.
It can be worked around by creating another layer of abstraction that rewrites the chat history to make it make sense to the model without ruining future generations - but that would require adaptations for each model, and is also not something I think should be implemented as part of this library.

I plan a big overhaul to CatAI that will include an OpenAI compatible API that will bridge this gap as it can hold a state, but it would take some time as I have some things I have to finish on node-llama-cpp first.

Uh oh!

n_tokens <= n_batch, beta, conversation history #176

Uh oh!

Uh oh!

scenaristeur Mar 7, 2024

it's a 16 core CPU only, no GPU, i'll try getLlama with GPU false . thxs. Perharps i've istalled some Vulkan tools trying some llm but it's a CPU only

Replies: 4 comments · 7 replies

Uh oh!

scenaristeur Mar 7, 2024 Author

Uh oh!

giladgd Mar 9, 2024 Maintainer

Uh oh!

Uh oh!

scenaristeur Mar 7, 2024 Author

Uh oh!

giladgd Mar 9, 2024 Maintainer

Uh oh!

Uh oh!

scenaristeur Mar 12, 2024 Author

Uh oh!

Uh oh!

giladgd Mar 13, 2024 Maintainer

Uh oh!

scenaristeur Mar 14, 2024 Author

Uh oh!

scenaristeur Mar 12, 2024 Author

Uh oh!

Uh oh!

giladgd Mar 13, 2024 Maintainer

Uh oh!

Uh oh!

scenaristeur Mar 20, 2024 Author

Uh oh!

giladgd Mar 20, 2024 Maintainer

scenaristeur
Mar 7, 2024

Replies: 4 comments 7 replies

scenaristeur
Mar 7, 2024
Author

giladgd Mar 9, 2024
Maintainer

scenaristeur
Mar 7, 2024
Author

giladgd Mar 9, 2024
Maintainer

scenaristeur
Mar 12, 2024
Author

giladgd Mar 13, 2024
Maintainer

scenaristeur Mar 14, 2024
Author

scenaristeur
Mar 12, 2024
Author

giladgd Mar 13, 2024
Maintainer

scenaristeur Mar 20, 2024
Author

giladgd Mar 20, 2024
Maintainer