-
Notifications
You must be signed in to change notification settings - Fork 12.4k
llama : support qwen3 rerank and embeddings #14029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
3f3b9a2
f8fd440
e0eb4b8
030dc3b
f8facb3
0777cd3
8edd2cf
c02f53d
c2f4dc7
cbb6f20
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1577,10 +1577,15 @@ void llm_graph_context::build_pooling( | |
cur = ggml_add(ctx0, ggml_mul_mat(ctx0, cls_out, cur), cls_out_b); | ||
} | ||
} else if (cls_out) { | ||
// Single layer classification head (direct projection) | ||
// https://github.com/huggingface/transformers/blob/f4fc42216cd56ab6b68270bf80d811614d8d59e4/src/transformers/models/bert/modeling_bert.py#L1476 | ||
GGML_ASSERT(cls_out_b != nullptr); | ||
cur = ggml_add(ctx0, ggml_mul_mat(ctx0, cls_out, inp), cls_out_b); | ||
if (arch == LLM_ARCH_QWEN3) { | ||
cur = ggml_mul_mat(ctx0, cls_out, inp); | ||
cur = ggml_soft_max(ctx0, cur); // qwen3 uses softmax on the output | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ggerganov I think there is a bug with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can make quick fix for now like this, similar to diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index afef84870..8b11197df 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7043,7 +7043,7 @@ struct llm_build_qwen3 : public llm_graph_context {
Qcur, Kcur, Vcur, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);
}
- if (il == n_layer - 1) {
+ if (il == n_layer - 1 && pooling_type == LLAMA_POOLING_TYPE_NONE) {
// skip computing output for unused tokens
ggml_tensor * inp_out_ids = build_inp_out_ids();
cur = ggml_get_rows(ctx0, cur, inp_out_ids); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No that still doesn't work as I expected. For example, if my sequence has only one output token, then I expect the Maybe I misunderstood something here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm ok I think I got it. The main problem is that qwen's rerank model use causal attention, it's simply a normal next generation model which outputs either "yes" or "no" token I think the assumption in llama.cpp is that CLS and RANK are non-causal, hence only the first token is marked as output Not sure what's the best way to support this though There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK found a hack around this, for Qwen3, I force the position to last (only the position, not the pooling) in 030dc3b Probably we should separate the notion of "pooling" and "output position" in the future There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The idea is that the For Qwen3 rerank, what you seem to need is to pool using And it seems we should remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm ok I got it. The problem is that I don't have much time for the rest of the day. Do you think we can clean this up in a follow up PR?
I think having the notion of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Think it's better to take the time and make it right, no need to merge it now. |
||
} else { | ||
// Single layer classification head (direct projection) | ||
// https://github.com/huggingface/transformers/blob/f4fc42216cd56ab6b68270bf80d811614d8d59e4/src/transformers/models/bert/modeling_bert.py#L1476 | ||
GGML_ASSERT(cls_out_b != nullptr); | ||
cur = ggml_add(ctx0, ggml_mul_mat(ctx0, cls_out, inp), cls_out_b); | ||
} | ||
} else { | ||
GGML_ABORT("RANK pooling requires either cls+cls_b or cls_out+cls_out_b"); | ||
} | ||
|
Uh oh!
There was an error while loading. Please reload this page.