bm25 recall case #41014

zhu3359 · 2025-03-31T09:36:58Z

zhu3359
Mar 31, 2025

according Full Text Search (BM25) https://milvus.io/docs/full-text-search.md#Full-Text-Search-BM25, i create a collection of a book list, book name list as blew:

穿越女尊后，我带着夫郎们暴富了
嘿！穿越仙界偷偷成大佬
穿越75，从召唤金雕开始带着全家吃肉
我自曝穿越者后，大明打爆了全球
还没穿越，我就有神级资质了？
穿越后我带全家摆摊种田
穿越成老娘后，把不孝子打服了
全校穿越：我打造了龙族序列
穿越六零，我靠代购发家致富

use standard analyzer to create sparse_float_vector field for book name as "book_name_bm25", schema is like:
{'auto_id': False, 'description': '', 'fields': [{'name': 'book_id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'book_name', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512, 'enable_analyzer': True}}, {'name': 'author_id', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'author_name', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512, 'enable_analyzer': True}}, {'name': 'book_name_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'author_name_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'book_name_bm25', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}, {'name': 'author_name_bm25', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'is_function_output': True}], 'enable_dynamic_field': False, 'functions': [{'name': 'book_name_bm25_emb', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['book_name'], 'output_field_names': ['book_name_bm25'], 'params': {}}, {'name': 'author_name_bm25_emb', 'description': '', 'type': <FunctionType.BM25: 1>, 'input_field_names': ['author_name'], 'output_field_names': ['author_name_bm25'], 'params': {}}]}

and search code is :

search_params = {
'params': {'drop_ratio_search': 0.2},
}
res = client.search(
collection_name="bm25_test",
data=['穿越'],
anns_field='book_name_bm25',
limit=3,
search_params=search_params
)
print(res)

the result list is empty, but every book name contains the term "穿越", is this normal? any solution to solve this problem? change analyzers?

Answered by yhmo

Mar 31, 2025

Since the text content is Chinese, you need to set the tokenizer to be "Chinese". https://milvus.io/docs/analyzer-overview.md
Try this script:


from pymilvus import (
    MilvusClient, DataType, Function, FunctionType,
)

import random

client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)
print(client.get_server_version())

collection_name = "BBB"

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text",
                 datatype=DataType.VARCHAR,
                 max_length=1000,
                 enable_analyzer=True,
                 analyzer_params={"type"…

View full answer

yhmo · 2025-03-31T11:02:50Z

yhmo
Mar 31, 2025
Collaborator

Since the text content is Chinese, you need to set the tokenizer to be "Chinese". https://milvus.io/docs/analyzer-overview.md
Try this script:


from pymilvus import (
    MilvusClient, DataType, Function, FunctionType,
)

import random

client = MilvusClient(
    uri="http://localhost:19530",
    token="root:Milvus"
)
print(client.get_server_version())

collection_name = "BBB"

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text",
                 datatype=DataType.VARCHAR,
                 max_length=1000,
                 enable_analyzer=True,
                 analyzer_params={"type": "chinese",},
                 )
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)


bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25, # Set to BM25
)

schema.add_function(bm25_function)


index_params = client.prepare_index_params()

index_params.add_index(
    field_name="sparse",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX", # Inverted index type for sparse vectors
    metric_type="BM25",
)

client.drop_collection(collection_name=collection_name)
client.create_collection(
    collection_name=collection_name,
    schema=schema,
    index_params=index_params
)

client.insert(collection_name, [
    {'text': '穿越女尊后，我带着夫郎们暴富了'},
    {'text': '嘿！穿越仙界偷偷成大佬'},
    {'text': '穿越75，从召唤金雕开始带着全家吃肉'},
    {'text': '我自曝穿越者后，大明打爆了全球'},
    {'text': '还没穿越，我就有神级资质了？'},
    {'text': '穿越后我带全家摆摊种田'},
    {'text': '穿越成老娘后，把不孝子打服了'},
    {'text': '全校穿越：我打造了龙族序列'},
    {'text': '穿越六零，我靠代购发家致富'},

])

client.flush(collection_name=collection_name)
print(client.query(collection_name=collection_name, output_fields=["count(*)"]))

results = client.search(collection_name=collection_name,
                data=["种田穿越，我是大明"],
                anns_field="sparse",
                limit=3,
                output_fields=["text"],
                search_params={})
print("search result of text embedding")
for res in results:
    print("============================")
    for r in res:
        print(r)

0 replies

zhu3359 · 2025-03-31T12:29:59Z

zhu3359
Mar 31, 2025
Author

@yhmo chinese type works, then i write a script to test bm25 time cost, code like

search_params = {
'params': {'drop_ratio_search': 0.2},
}
res=client.load_collection(
collection_name="bm25_test"
)
start = time.time()
res = client.search(
collection_name="bm25_test",
data=['南瓜'],
anns_field='author_name_bm25',
limit=9,
search_params=search_params
)
cost = time.time() - start
print(f"{res}")
print (f"cost: {cost}")
start = time.time()
res = client.search(
collection_name="bm25_test",
data=['穿越'],
anns_field='book_name_bm25',
limit=9,
search_params=search_params
)
cost = time.time() - start
print(f"{res}")
print(f"cost: {cost}")

find that the first search time cost is almost 300ms, but second search cost is less than 3ms, what makes this difference?

2 replies

zhu3359 Mar 31, 2025
Author

i change the search order， get similar result，it seems that first search cost much more time

xiaofan-luan Mar 31, 2025
Maintainer

i change the search order， get similar result，it seems that first search cost much more time

You are right, due to mmap feature, the first search need more time to warm the cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bm25 recall case #41014

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

bm25 recall case #41014

Uh oh!

zhu3359 Mar 31, 2025

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

yhmo Mar 31, 2025 Collaborator

Uh oh!

zhu3359 Mar 31, 2025 Author

Uh oh!

zhu3359 Mar 31, 2025 Author

Uh oh!

xiaofan-luan Mar 31, 2025 Maintainer

zhu3359
Mar 31, 2025

Replies: 2 comments 2 replies

yhmo
Mar 31, 2025
Collaborator

zhu3359
Mar 31, 2025
Author

zhu3359 Mar 31, 2025
Author

xiaofan-luan Mar 31, 2025
Maintainer