自定义分析器问题 #41262

Qiang-HU · 2025-04-12T07:55:23Z

Qiang-HU
Apr 12, 2025

analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["my_vocab"],
"mode": "search",
"hmm": False
}
}
schema.add_field(field_name="text", datatype=DataType.VARCHAR,
analyzer_params=analyzer_params,
max_length=2048,
enable_analyzer=True,
enable_match=True
)
当我自定义jieba分词器时，开启文本匹配 enable_match=True，在创建collection时会报错：
pymilvus.exceptions.MilvusException: <MilvusException: (code=2000, message=failed to validate text schema, C Runtime Exception: Assert "res.result_->success" => Tokenizer creation failed: create tokenizer failed with error: InternalError: tokenizer name should be string param: {"tokenizer":{"type":"jieba","dict":["default"],"mode":"search","hmm":false}} at /workspace/source/internal/core/thirdparty/tantivy/tokenizer.h:19

只有使用analyzer_params = {"tokenizer": "jieba"}才能成功创建

这意味着在开启文本匹配的情况下，无法使用自定义jieba分词器吗？

还有，在使用client.run_analyzer()时，会报错AttributeError: 'MilvusClient' object has no attribute 'run_analyzer'

xiaofan-luan · 2025-04-12T23:59:31Z

xiaofan-luan
Apr 12, 2025
Maintainer

analyzer_params = { "tokenizer": { "type": "jieba", "dict": ["my_vocab"], "mode": "search", "hmm": False } } schema.add_field(field_name="text", datatype=DataType.VARCHAR, analyzer_params=analyzer_params, max_length=2048, enable_analyzer=True, enable_match=True ) 当我自定义jieba分词器时，开启文本匹配 enable_match=True，在创建collection时会报错： pymilvus.exceptions.MilvusException: <MilvusException: (code=2000, message=failed to validate text schema, C Runtime Exception: Assert "res.result_->success" => Tokenizer creation failed: create tokenizer failed with error: InternalError: tokenizer name should be string param: {"tokenizer":{"type":"jieba","dict":["default"],"mode":"search","hmm":false}} at /workspace/source/internal/core/thirdparty/tantivy/tokenizer.h:19

只有使用analyzer_params = {"tokenizer": "jieba"}才能成功创建

这意味着在开启文本匹配的情况下，无法使用自定义jieba分词器吗？

还有，在使用client.run_analyzer()时，会报错AttributeError: 'MilvusClient' object has no attribute 'run_analyzer'

this is not released yet?

right now milvus don't support customized dict.

1 reply

Qiang-HU Apr 13, 2025
Author

非常感谢您在百忙之中抽时间回答我的问题！
可是抱歉，可能是我的提问方式不对，我重新表述一下我的问题。以下是我的代码：
from pymilvus import MilvusClient, DataType, Function,FunctionType
client = MilvusClient(uri='http://localhost:19530')
schema = client.create_schema(auto_id=True, enable_dynamic_field=False)

analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["default", "苹果电脑", "电子宠物"],
"mode": "search",
"hmm": False
}
}

schema.add_field(field_name="vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR,
analyzer_params=analyzer_params,
max_length=2048,enable_analyzer=True,enable_match=True)

bm25_function = Function(
name="text_bm25_emb",
input_field_names=["text"],
output_field_names=["vector"],
function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)
index_params = client.prepare_index_params()
index_params.add_index(
field_name="vector",
index_name="sparse_inverted_index",
index_type="SPARSE_INVERTED_INDEX",
metric_type="BM25",
params={"inverted_index_algo": "DAAT_MAXSCORE","bm25_k1": 1.2,"bm25_b": 0.75},
)
client.create_collection(collection_name='test_bm25',schema=schema,index_params=index_params)

我想基于bm25做全文检索或者文本匹配并想添加一部分我自己的分词，所以根据文档：
https://milvus.io/docs/chinese-analyzer.md
https://milvus.io/docs/jieba-tokenizer.md
我找到了我代码所示的方法(来自于Custom configuration部分)，但是在创建collection时，报错了：
pymilvus.exceptions.MilvusException: <MilvusException: (code=2000, message=failed to validate text schema, C Runtime Exception: Assert "res.result_->success" => Tokenizer creation failed: create tokenizer failed with error: InternalError: tokenizer name should be string param: {"tokenizer":{"type":"jieba","dict":["default","\u82f9\u679c\u7535\u8111","\u7535\u5b50\u5ba0\u7269"],"mode":"search","hmm":false}} at /workspace/source/internal/core/thirdparty/tantivy/tokenizer.h:19
看报错原因，是因为使用的Tantivy库并不支持analyzer_params的tokenizer参数为字典，必须是string。
这好像跟文档冲突了，或者说我的使用方式不正确？

xiaofan-luan · 2025-04-13T17:38:59Z

xiaofan-luan
Apr 13, 2025
Maintainer

I'm assuming this is not supported until 2.6.

I will double check with document team

1 reply

Qiang-HU Apr 14, 2025
Author

好的非常感谢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

自定义分析器问题 #41262

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

自定义分析器问题 #41262

Uh oh!

Qiang-HU Apr 12, 2025

Replies: 2 comments · 2 replies

Uh oh!

xiaofan-luan Apr 12, 2025 Maintainer

Uh oh!

Qiang-HU Apr 13, 2025 Author

Uh oh!

xiaofan-luan Apr 13, 2025 Maintainer

Uh oh!

Qiang-HU Apr 14, 2025 Author

Qiang-HU
Apr 12, 2025

Replies: 2 comments 2 replies

xiaofan-luan
Apr 12, 2025
Maintainer

Qiang-HU Apr 13, 2025
Author

xiaofan-luan
Apr 13, 2025
Maintainer

Qiang-HU Apr 14, 2025
Author