Inconsistent search results #42717

suckseed5 · 2025-06-13T06:37:51Z

suckseed5
Jun 13, 2025

After inserting the data, the search can find the correct data, but after a while, the retrieved data is incorrect.
When inserting the data again, it can be retrieved again.
Why is this the case?

def _search(self, query_text, top_k=3, mode="vector", score_threshold=0.0, partition_name=None,ranker_params=[0.8,0.2]): self.client.load_collection(collection_name=self.collection_name) print("Running hybrid search...") query_vector = self._embed_text(query_text) dense_req = AnnSearchRequest( data=[query_vector], anns_field="dense_vector", param={"metric_type": "IP","index_type":"HNSW","params": {"M": 8, "efConstruction": 64}}, limit=top_k ) sparse_req = AnnSearchRequest( data=[query_text], anns_field="sparse_vector", param={"metric_type": "BM25", "params": {"drop_ratio_build": 0.2}}, limit=top_k ) # 使用加权排名策略 ranker_weight = WeightedRanker(ranker_params[0], ranker_params[1]) # 使用 RRFRanker ranker_rrf=RRFRanker(100) search_kwargs = { "collection_name": self.collection_name, "reqs": [dense_req, sparse_req], "ranker": ranker_weight, "limit": top_k, "output_fields": ["pk", "text", "metadata"], "consistency_level": "Strong", "partition_names": ["partition_a","partition_b"] } results = self.client.hybrid_search(**search_kwargs)

yhmo · 2025-06-13T06:44:47Z

yhmo
Jun 13, 2025
Collaborator

Possible reason: there are duplicate primary keys in the collection.

2 replies

suckseed5 Jun 13, 2025
Author

Hi, is there a problem when creating the following?

 def _create_collection(self):
        schema = self.client.create_schema(
            auto_id=True,
            enable_dynamic_fields=False,
        )
        dim = len(self._embed_text("dimensions"))
        schema.add_field("pk", DataType.INT64, is_primary=True, max_length=100)
        schema.add_field("text", DataType.VARCHAR, max_length=65535, enable_analyzer=True,enable_match=True,analyzer_params =self.analyzer_params )
        schema.add_field("dense_vector", DataType.FLOAT_VECTOR, dim=dim)
        schema.add_field("sparse_vector", DataType.SPARSE_FLOAT_VECTOR)
        schema.add_field("metadata",DataType.JSON, max_length=65535)

        bm25_function = Function(
            name="text_bm25_emb",
            input_field_names=["text"],
            output_field_names=["sparse_vector"],
            function_type=FunctionType.BM25,
        )
        schema.add_function(bm25_function)

        self.client.create_collection(
            collection_name=self.collection_name,
            schema=schema
        )
        print(f"Collection `{self.collection_name}` created.")

    def _create_index(self):
        index_params = self.client.prepare_index_params()

        index_params.add_index(
            field_name="dense_vector",
            index_type="HNSW", 
            metric_type="IP",  
            params={"M": 8, "efConstruction": 64}
        )

        # 稀疏向量索引（关键词检索）
        index_params.add_index(
            field_name="sparse_vector",
            index_name="sparse_inverted_index",
            index_type="SPARSE_INVERTED_INDEX",
            metric_type="BM25",
        )

        self.client.create_index(
            collection_name=self.collection_name,
            index_params=index_params
        )

        print("Dense and sparse indexes created.")

    def insert_documents(self, documents=None, partition_name=None):
        data = []
        start_time = time.time()
        for idx, doc in enumerate(documents):
            embedding = self._embed_text(doc.page_content)
            document = {
                    "text": doc.page_content,
                    "dense_vector": embedding,
                    "metadata": doc.metadata,
                    }
            data.append(document)
        end_time = time.time()
        time_need = end_time-start_time
        texts_len = len(documents)
        res = self.client.insert(
            collection_name=self.collection_name,
            data=data,
            partition_name=partition_name
        )
        insert_end_time = time.time()
        insert_need_time = insert_end_time - end_time
        if partition_name:
            self.client.load_partitions(
                collection_name=self.collection_name,
                partition_names=[partition_name],
            )
        else:
            self.client.load_collection(collection_name=self.collection_name)
        
        self.print_all_partition_row_counts(partition_name)
        part_info = f"in partition `{partition_name}`" if partition_name else ""
        print(f"Inserted {len(documents)} vectors into `{self.collection_name}` {part_info}.")
        return res

yhmo Jun 16, 2025
Collaborator

PK is auto-generated by milvus, and milvus ensures each pk is unique. Seems this script has no problem.

a. "After inserting the data, the search can find the correct data"
b. "but after a while, the retrieved data is incorrect."
c. "When inserting the data again, it can be retrieved again."

What is the "query_text" and what result is returned for a, b, and c?

xiaofan-luan · 2025-06-13T17:57:24Z

xiaofan-luan
Jun 13, 2025
Maintainer

why do data don't have a primary key?

2 replies

suckseed5 Jun 16, 2025
Author

schema.add_field("pk", DataType.INT64, is_primary=True, max_length=100)
Isn't it correct to write like this?

xiaofan-luan Jun 16, 2025
Maintainer

        document = {
                "text": doc.page_content,
                "dense_vector": embedding,
                "metadata": doc.metadata,
                }
        data.append(document) your data seems don't have PK

suckseed5 · 2025-06-16T00:51:48Z

suckseed5
Jun 16, 2025
Author

A new problem was discovered. Two different servers only had different memory sizes. This situation would occur with the smaller memory, but the data volume was only 3,000.

1 reply

xiaofan-luan Jun 16, 2025
Maintainer

what does it matter for 3000 data.

To split data into segments you will need at least several GB datas. If all the data is in one single segments. Then you need to load with 2 replicas otherwise the segment can be loaded on only one node

suckseed5 · 2025-06-16T04:50:32Z

suckseed5
Jun 16, 2025
Author

Thank you everyone, I found the problem, I need to flush after inserting.

self.client.flush(collection_name=self.collection_name)

2 replies

yhmo Jun 16, 2025
Collaborator

But you have set "consistency_level" to be "Strong", no need to call flush().

suckseed5 Jun 18, 2025
Author

It's strange, but it works after changing it. It seems to be related to the server memory configuration.

Inconsistent search results #42717

Uh oh!

suckseed5 Jun 13, 2025

Replies: 4 comments · 7 replies

Uh oh!

yhmo Jun 13, 2025 Collaborator

Uh oh!

suckseed5 Jun 13, 2025 Author

Uh oh!

yhmo Jun 16, 2025 Collaborator

Uh oh!

xiaofan-luan Jun 13, 2025 Maintainer

Uh oh!

suckseed5 Jun 16, 2025 Author

Uh oh!

xiaofan-luan Jun 16, 2025 Maintainer

Uh oh!

suckseed5 Jun 16, 2025 Author

Uh oh!

xiaofan-luan Jun 16, 2025 Maintainer

Uh oh!

suckseed5 Jun 16, 2025 Author

Uh oh!

yhmo Jun 16, 2025 Collaborator

Uh oh!

suckseed5 Jun 18, 2025 Author

suckseed5
Jun 13, 2025

Replies: 4 comments 7 replies

yhmo
Jun 13, 2025
Collaborator

suckseed5 Jun 13, 2025
Author

yhmo Jun 16, 2025
Collaborator

xiaofan-luan
Jun 13, 2025
Maintainer

suckseed5 Jun 16, 2025
Author

xiaofan-luan Jun 16, 2025
Maintainer

suckseed5
Jun 16, 2025
Author

xiaofan-luan Jun 16, 2025
Maintainer

suckseed5
Jun 16, 2025
Author

yhmo Jun 16, 2025
Collaborator

suckseed5 Jun 18, 2025
Author