Error in embedding document using BM25EmbeddingFunction #40954

shalini0311 · 2025-03-27T08:44:32Z

shalini0311
Mar 27, 2025

Hello team, I have implemented Hybrid search which was working fine till recently. But now it's throwing error.
With some digging, it seems like the implementation of Sparse embedding is throwing error.
Any help to resolve this issue is much appreciated!

Code throwing error:

from langchain_milvus.utils.sparse import BM25SparseEmbedding
from typing import List

from milvus_model.sparse import BM25EmbeddingFunction


class MyBM25SparseEmbedding(BM25SparseEmbedding):
    def __init__(self, corpus: List[str] = None):
        if corpus is not None:
            super().__init__(corpus=corpus)

    def load(self, save_path: str):
        self.bm25_ef = BM25EmbeddingFunction()
        self.bm25_ef.load(save_path)

    def save(self, save_path: str):
        self.bm25_ef.save(save_path)


bm25_store_file = 'bm25.json'  # Name of the file to save the sparse embeddings

docs = ["hello world", "welcome to my world"]

sparse_embedding_func = MyBM25SparseEmbedding(docs)
sparse_embedding_func.save(bm25_store_file)
print(sparse_embedding_func.bm25_ef.idf)

print(sparse_embedding_func.embed_documents(docs)) # Error here

Error message:

ValueError: not enough values to unpack (expected 2, got 1)

Answered by yhmo

Apr 1, 2025

@shalini0311
Downgrade your scipy version to 1.14.1.

I believe there is a behavior change for scipy.sparse.vsstack in scipy 1.15.0. The output of scipy.sparse.vsstack.tocsr() is changed.
The milvus_model.sparse.BM25EmbeddingFunction calls scipy.sparse.vsstack.tocsr() to generate a sparse array.

Use this script to test.
With scipy 1.14.1, it works fine.
With scipy 1.15.0, it throws "not enough values to unpack" error.

import numpy as np
from scipy.sparse import csr_array, vstack

sparse_embs = []

values = [1.0687022900763359, 1.4973262032085561]
rows = [0, 0]
cols = [1, 0]
sparse = csr_array((values, (rows, cols)), shape=(1, 4)).astype(np.float32)
sparse_embs.append(sparse)

values = [1.3…

View full answer

yhmo · 2025-03-27T09:26:22Z

yhmo
Mar 27, 2025
Collaborator

I use your script to test, didn't get the error, my pymilvus_model version is 0.3.0

0 replies

shalini0311 · 2025-03-27T09:53:41Z

shalini0311
Mar 27, 2025
Author

Mine is 0.3.1 version. Downgraded it to 0.3.0 and I'm still getting the same error.

May i know other relevant libraries versions which works for you?

1 reply

yhmo Mar 27, 2025
Collaborator

Do you have the call stack of the error? Looks like the error is thrown from pandas.
My pandas version is 2.2.2

shalini0311 · 2025-03-27T10:24:04Z

shalini0311
Mar 27, 2025
Author

I'm not using pandas in this script.

This is the error stack:

Traceback (most recent call last):
  File "/Users/test.py", line 28, in <module>
    print(sparse_embedding_func.embed_documents(docs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test_venv/lib/python3.12/site-packages/langchain_milvus/utils/sparse.py", line 45, in embed_documents
    return [self._sparse_to_dict(sparse_array) for sparse_array in sparse_arrays]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test_venv/lib/python3.12/site-packages/langchain_milvus/utils/sparse.py", line 48, in _sparse_to_dict
    row_indices, col_indices = sparse_array.nonzero()
    ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)

9 replies

xiaofan-luan Mar 29, 2025
Maintainer

what langchain-milvus version you are using?

if you use milvus 2.5.7 and langchain-milvus v0.1.8, bm25 embedding generation is no longer at serverside any more.

yhmo Mar 31, 2025
Collaborator

Name: langchain-milvus
Version: 0.1.4

Name: pymilvus.model
Version: 0.3.0

Name: milvus-model
Version: 0.2.12

shalini0311 Mar 31, 2025
Author

Hi @yhmo , @xiaofan-luan, I'm using the following versions:

pymilvus- 2.5.6
pymilvus.model- 0.30,
langchain-milvus- 0.1.4,
milvus-model- 0.2.12

Any idea what could be the reason for this error. Any suggestions are really appreciated!!

yhmo Apr 1, 2025
Collaborator

@shalini0311
Downgrade your scipy version to 1.14.1.

I believe there is a behavior change for scipy.sparse.vsstack in scipy 1.15.0. The output of scipy.sparse.vsstack.tocsr() is changed.
The milvus_model.sparse.BM25EmbeddingFunction calls scipy.sparse.vsstack.tocsr() to generate a sparse array.

Use this script to test.
With scipy 1.14.1, it works fine.
With scipy 1.15.0, it throws "not enough values to unpack" error.

import numpy as np
from scipy.sparse import csr_array, vstack

sparse_embs = []

values = [1.0687022900763359, 1.4973262032085561]
rows = [0, 0]
cols = [1, 0]
sparse = csr_array((values, (rows, cols)), shape=(1, 4)).astype(np.float32)
sparse_embs.append(sparse)

values = [1.3658536585365855, 0.9395973154362417, 0.9395973154362417]
rows = [0, 0, 0]
cols = [3, 1, 2]
sparse = csr_array((values, (rows, cols)), shape=(1, 4)).astype(np.float32)
sparse_embs.append(sparse)

kk = vstack(sparse_embs).tocsr()
print(kk)
print("==========================")
for s in kk:
    row_indices, col_indices = s.nonzero()
    print(row_indices, col_indices)

I don't know why scipy made this change. Now the workaround is to downgrade the scipy version to 1.14.
I have created an issue to trace: milvus-io/milvus-model#76

Answer selected by shalini0311

shalini0311 Apr 1, 2025
Author

@yhmo , Thanks a lot for your help! Your solution works perfectly. I really appreciate you taking the time to identify the root cause—I was stuck on this issue for a long time and was wondering how you managed to pinpoint it so accurately.

Thanks again for your support!

codingjaguar · 2025-04-02T01:21:36Z

codingjaguar
Apr 2, 2025

Hi @shalini0311 unrelated this issue, the recommendation is to use Milvus' native support for full text search instead of using from milvus_model.sparse import BM25EmbeddingFunction to generate sparse vector externally. The support is also extended to langchain-milvus package. https://milvus.io/docs/full_text_search_with_langchain.md

This has a lot of benefit such as you don't need to worry about vocabulary change as you add more documents to the corpus, and it's more efficient that externally computing the BM25 score.

0 replies

Error in embedding document using BM25EmbeddingFunction #40954

Uh oh!

shalini0311 Mar 27, 2025

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

yhmo Mar 27, 2025 Collaborator

Uh oh!

Uh oh!

shalini0311 Mar 27, 2025 Author

Uh oh!

yhmo Mar 27, 2025 Collaborator

Uh oh!

shalini0311 Mar 27, 2025 Author

Uh oh!

xiaofan-luan Mar 29, 2025 Maintainer

Uh oh!

yhmo Mar 31, 2025 Collaborator

Uh oh!

shalini0311 Mar 31, 2025 Author

Uh oh!

Uh oh!

yhmo Apr 1, 2025 Collaborator

Uh oh!

shalini0311 Apr 1, 2025 Author

Uh oh!

codingjaguar Apr 2, 2025

shalini0311
Mar 27, 2025

Replies: 4 comments 10 replies

yhmo
Mar 27, 2025
Collaborator

shalini0311
Mar 27, 2025
Author

yhmo Mar 27, 2025
Collaborator

shalini0311
Mar 27, 2025
Author

xiaofan-luan Mar 29, 2025
Maintainer

yhmo Mar 31, 2025
Collaborator

shalini0311 Mar 31, 2025
Author

yhmo Apr 1, 2025
Collaborator

shalini0311 Apr 1, 2025
Author

codingjaguar
Apr 2, 2025