Skip to content

First working PoC for bge-m3 sparse embeddings #14526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Mar 9, 2025

FIX #13609
FIX #15384
FIX #18469

Here I'm loading the extra sparse_linear.pt file using the secondary_weights loading introduced in the ultravox model when I detect that the model name is BAAI/bge-m3. It's a bit ugly but I don't know if there is a more generic way to do this.

Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call tokenize and remove the bos and eos token and then the indices of both vectors should match.

To request sparse vectors the use has to pass
"additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding.

The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence.

To show that this PoC is already returning the correct results, consider the code below:

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
print(model.convert_id_to_token(output_1['lexical_weights']))

This code prints

[{'What': 0.08344, 'is': 0.08136, 'B': 0.1295, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04086}, {'De': 0.05023, 'fin': 0.1368, 'ation': 0.0452, 'of': 0.0635, 'BM': 0.2515, '25': 0.3337}]

With vllm we get the following:

$ curl -s http://localhost:8000/v1/embeddings    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "input": ["What is BGE M3?", "Defination of BM25"],
     "additional_data": {"sparse_embeddings": true}
}' | jq
{
  "id": "embd-38ce076880b94d41b206ae99caae7b19",
  "object": "list",
  "created": 1741555561,
  "model": "BAAI/bge-m3",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0836181640625,
        0.08148193359375,
        0.1295166015625,
        0.251708984375,
        0.1700439453125,
        0.269775390625,
        0.040924072265625
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        0.050201416015625,
        0.136962890625,
        0.04510498046875,
        0.0633544921875,
        0.25146484375,
        0.333740234375
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

Here I'm loading the extra `sparse_linear.pt` file using the
secondary_weights loading introduced in the ultravox model
when I detect that the model name is `BAAI/bge-m3`. It's a bit
ugly but I don't know if there is a more generic way to do this.

Currently, since the only permissible pooling return type is
torch.tensor, I'm just returning the token weights tensor directly.
If the use wants to match tokens to the weights they have to
call `tokenize` and remove the bos and eos token and then the
indices of both vectors should match.

To request sparse vectors the use has to pass
"additional_data": {"sparse_embeddings": true} in the request.
This means that all sequences in that request will be treated
as sparse. If the user wants to mix, separate calls have to be
made for each type of embedding.

The FlagEmbedding API allows to return more then one type of embedding
at the same time, but currently, due to the limitation of the
pooling return type we can only return a single tensor per sequence.

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Copy link

github-actions bot commented Mar 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 10, 2025

To support sparse+dense together, we need to actually implement #12249. I still don't have time to implement this though.

This is cleaner and can be activated by the user by setting
`--hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'`

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@maxdebayser
Copy link
Contributor Author

I've changed the implementation so that now the user has to add --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' to the command line to activate this mode. But I agree that we need to implement #12249 to properly support this and other models like ibm-granite/granite-embedding-30m-sparse. Let's keep this PR in draft state for now.

@243006306
Copy link

This is great, looking forward to the launch of this feature, how long will it take for this feature to be available?

@IllyaPysarchuk
Copy link

+1, waiting for this feature.

Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2025
@arjunasuresh300
Copy link

+1

@Sam120204
Copy link

any update?

@maxdebayser
Copy link
Contributor Author

The V1 embedding PR is already approved but is now blocked by other unrelated test failures: #16188 . The next step will be to add support for encoder models as they have been left out of the embedding model PR to make it simpler.

@fufenghua
Copy link

现在还不支持哇

@mergify mergify bot added the new-model Requests to new models label Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase new-model Requests to new models
Projects
None yet
7 participants