-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
First working PoC for bge-m3 sparse embeddings #14526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Here I'm loading the extra `sparse_linear.pt` file using the secondary_weights loading introduced in the ultravox model when I detect that the model name is `BAAI/bge-m3`. It's a bit ugly but I don't know if there is a more generic way to do this. Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call `tokenize` and remove the bos and eos token and then the indices of both vectors should match. To request sparse vectors the use has to pass "additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding. The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence. Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
To support sparse+dense together, we need to actually implement #12249. I still don't have time to implement this though. |
This is cleaner and can be activated by the user by setting `--hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'` Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
I've changed the implementation so that now the user has to add |
This is great, looking forward to the launch of this feature, how long will it take for this feature to be available? |
+1, waiting for this feature. |
This pull request has merge conflicts that must be resolved before it can be |
+1 |
any update? |
The V1 embedding PR is already approved but is now blocked by other unrelated test failures: #16188 . The next step will be to add support for encoder models as they have been left out of the embedding model PR to make it simpler. |
现在还不支持哇 |
FIX #13609
FIX #15384
FIX #18469
Here I'm loading the extra
sparse_linear.pt
file using the secondary_weights loading introduced in the ultravox model when I detect that the model name isBAAI/bge-m3
. It's a bit ugly but I don't know if there is a more generic way to do this.Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call
tokenize
and remove the bos and eos token and then the indices of both vectors should match.To request sparse vectors the use has to pass
"additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding.
The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence.
To show that this PoC is already returning the correct results, consider the code below:
This code prints
With vllm we get the following: