Taking in large prompts (5000 characters and up) #3704

ActuallyTaylor · 2023-10-20T23:12:35Z

ActuallyTaylor
Oct 20, 2023

I am working on a project that requires running text generation on prompts of over 5000 characters. Currently loading them into a single tokenization call results in the fail state created in #1881. I believe this is because I am attempting to load significantly more tokens than what is supposed to be loaded at once.

I have looked over the examples provided but I have not found any examples of loading really large prompts. I have done a little bit of research into loading large prompts but I am admittedly very new to this field and was wondering if I could receive some guidance specific to this project.

Through my research I have found techniques like chunking my input into smaller fragments, however I am not sure how I would implement this using the llama.cpp api. I already have actual string fragmentation completed but my question lies in how this would be sent to the model. Is this the right step forward or are there other resources / techniques that I should explore?

shibe2 · 2023-10-21T01:41:08Z

shibe2
Oct 21, 2023

I regularly use large prompts like 10000 characters. You should describe issues that you are encountering in more detail.

6 replies

shibe2 Oct 21, 2023

Try batch size 512. When you get it working, experiment with different batch sizes.

ActuallyTaylor Oct 21, 2023
Author

When I set a batch size to 512, I end up receiving this error. Should I be creating multiple batches using llama_batch_init?

llama.cpp/llama.cpp:5745: n_tokens <= n_batch

shibe2 Oct 21, 2023

Maybe call llama_decode(llama_batch_get_one()) multiple times with no more tokens than the batch size.

ActuallyTaylor Oct 22, 2023
Author

When using llama_batch_get_one, I am able to decode all of my tokens, but when I attempt to sample I am just being given all whitespace and a token ID of 13. Could you point me in the direction of the right sample strategy? Maybe just the example you use to process large prompts?

Edit
If I use the following code:

while currentToken <= computedLength {
    let logits = llama_get_logits(context)
    
    let candidatesPointer = UnsafeMutablePointer<llama_token_data>.allocate(capacity: Int(numberOfVocabTokens))
    
    for id in (0..<numberOfVocabTokens) {
        candidatesPointer[Int(id)] = llama_token_data(id: id, logit: logits![Int(id)], p: 0)
    }
    
    var candidates_p: llama_token_data_array = .init(data: candidatesPointer, size: Int(numberOfVocabTokens), sorted: false)
    
    let newTokenID = llama_sample_token(context, &candidates_p)

    guard newTokenID != llama_token_eos(context) && currentToken != computedLength else {
        break
    }
                        
    let piece = try tokenToPiece(token: newTokenID)

    output += piece
    currentToken += 1
}

I get basically garbage output that looks something like this:

 I But No But They My I Yet I Inc
 Still But Since The So ( This And Well
 But
 When

 I

 They Is They In Is

I think my strategy of loading all of the batches by looping over prompt tokens in stride and then using llama_batch_get_one to create a batch and then decode with llama_decode is leading to some level of corruption that significantly confuses the model.

shibe2 Oct 22, 2023

I'm not familiar with that. Compare your code with examples.

If it works with short prompts (less than 2000 tokens), but outputs garbage for long prompts, one possible problem is that particular model is not capable of handling that amount of tokens with parameters that you use. Make sure that the same exact prompt and the same model file work with examples (main or server).

ActuallyTaylor · 2023-10-21T02:28:44Z

ActuallyTaylor
Oct 21, 2023
Author

I will try that! Will setting a batch size to 512 still work through all of the tokens? I am a little confused at how that function works under the hood!

…

On Oct 20, 2023 at 10:26 PM -0400, shibe2 ***@***.***>, wrote: Try batch size 512. When you get it working, experiment with different batch sizes. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

2 replies

shibe2 Oct 21, 2023

Yes, that is limited by context size. Of course, you need a model that supports large contexts, or at least scaling options.

ActuallyTaylor Oct 21, 2023
Author

Awesome that is great to know!

ActuallyTaylor · 2023-10-24T18:15:26Z

ActuallyTaylor
Oct 24, 2023
Author

I partially solved this issue by reimplementing the main.cpp file as a swift implementation. This seems to have solved taking prompts up to 4096 tokens. Anything past this context limit results in the LLM failing and returning garbage data. But for the purpose of this library it has been solved.

2 replies

shibe2 Oct 25, 2023

You are probably using a model that supports 4096 context and without scaling options.

ActuallyTaylor Oct 25, 2023
Author

I was indeed only using a model with a context limit of 4096. However, I have changed to a model with 32K recently, hoping for an increase in ability.

Taking in large prompts (5000 characters and up) #3704

Uh oh!

ActuallyTaylor Oct 20, 2023

Replies: 3 comments · 10 replies

Uh oh!

shibe2 Oct 21, 2023

Uh oh!

shibe2 Oct 21, 2023

Uh oh!

ActuallyTaylor Oct 21, 2023 Author

Uh oh!

Uh oh!

shibe2 Oct 21, 2023

Uh oh!

Uh oh!

ActuallyTaylor Oct 22, 2023 Author

Uh oh!

Uh oh!

shibe2 Oct 22, 2023

Uh oh!

ActuallyTaylor Oct 21, 2023 Author

Uh oh!

shibe2 Oct 21, 2023

Uh oh!

ActuallyTaylor Oct 21, 2023 Author

Uh oh!

ActuallyTaylor Oct 24, 2023 Author

Uh oh!

shibe2 Oct 25, 2023

Uh oh!

ActuallyTaylor Oct 25, 2023 Author

ActuallyTaylor
Oct 20, 2023

Replies: 3 comments 10 replies

shibe2
Oct 21, 2023

ActuallyTaylor Oct 21, 2023
Author

ActuallyTaylor Oct 22, 2023
Author

ActuallyTaylor
Oct 21, 2023
Author

ActuallyTaylor Oct 21, 2023
Author

ActuallyTaylor
Oct 24, 2023
Author

ActuallyTaylor Oct 25, 2023
Author