Skip to content

feat: Add support for SparseEncoder and sparse embedding models in Sentence Transformers #9588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Ryzhtus
Copy link
Contributor

@Ryzhtus Ryzhtus commented Jul 3, 2025

Related Issues

SentenceTransformers introduced support for sparse embedding models via the SparseEncoder class in v5.0.0. I thought it would be cool to support these in Haystack as well, since sparse models were previously only available through the FastEmbed integration (e.g. FastembedSparseTextEmbedder)

Proposed Changes:

Introduced two new embedder classes and also a class to manage these embedding classes:

  • SentenceTransformersSparseTextEmbedder
  • SentenceTransformersSparseDocumentEmbedder
  • SentenceTransformersSparseEncoderEmbeddingBackend

How did you test it?

I added unit tests for both embedders

Notes for the reviewer

Some tests are currently failing — I’d appreciate your support in resolving them.
And we’ll likely need to add documentation as well.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@Ryzhtus Ryzhtus requested a review from a team as a code owner July 3, 2025 17:52
@Ryzhtus Ryzhtus requested review from vblagoje and removed request for a team July 3, 2025 17:52
@anakin87 anakin87 self-requested a review July 4, 2025 05:55
@anakin87
Copy link
Member

anakin87 commented Jul 4, 2025

Hello and thanks for this idea!

I think it's a big topic and will probably require some work.

Some high-level notes:

  1. I would create a completely separate _SentenceTransformersSparseEmbeddingBackendFactory as we do for FastEmbed.
  2. Let's try to fit the returned sparse embedding into the existing Haystack SparseEmbedding dataclass.
  3. Let's add tests for the backend and integration tests for the two embedders.
  4. If you share a script or a raw Colab notebook with an end-to-end example, this would help validating and reviewing the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants