Skip to content

Cuda not utilized for token generation but only for prompt processing #3027

Answered by KerfuffleV2
emadeck asked this question in Q&A
Discussion options

You must be logged in to vote

Sorry, my fault.

Not a problem.

When the prompt evaluation starts I see the GPU working, but when it ends and start generating tokens I see the GPU at 0%, with small spikes at 1 or 2% now and then, and only the CPU active.

I don't think you did anything wrong here. Basically layers have to be evaluated sequentially for token generation so however many layers are on the CPU will get evaluated, then the ones on the GPU. So the GPU will be sitting idle for around 3/4 of the time when you're offloading 22 layers with brief spikes. Depending on the tool, if it just check the GPU usage periodically, it's pretty likely to miss those brief periods of activity.

Particularly when running on the…

Replies: 2 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@KerfuffleV2
Comment options

Answer selected by emadeck
@emadeck
Comment options

@jboero
Comment options

@emadeck
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants