Specify token budget during inference #8462

vladfaust · 2024-07-12T20:12:32Z

vladfaust
Jul 12, 2024

How can we tell a decode-only model a "token budget" when running inference loop so it doesn't just stop mid-way once the limit is reached, but somehow "plans ahead" to fit the response into the budget? Thanks for any tips.

dspasyuk · 2024-07-14T18:18:15Z

dspasyuk
Jul 14, 2024

1 reply

vladfaust Jul 15, 2024
Author

I want to limit the response length. Usually I tell the model something like "keep the response short (1-2 sentences)", but it's not reliable. Instead, I'd like to be able to set the token budget explicitly, e.g. to 50 tokens; with that information, a model would somehow "look ahead" during inference, picking the tokens which would result in fitting the budget.

Also, I'm embedding Llama.cpp into a C++ project, so I'm more interested in the programmatic side rather than CLI usage. Thanks for the response, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify token budget during inference #8462

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Specify token budget during inference #8462

Uh oh!

vladfaust Jul 12, 2024

Replies: 1 comment · 1 reply

Uh oh!

dspasyuk Jul 14, 2024

Uh oh!

vladfaust Jul 15, 2024 Author

vladfaust
Jul 12, 2024

Replies: 1 comment 1 reply

dspasyuk
Jul 14, 2024

vladfaust Jul 15, 2024
Author