FLAT Index in Milvus Producing Low Similarity Scores and Unrelated Top-K Results #40743
Unanswered
Bhagyashreet20
asked this question in
Q&A and General discussion
Replies: 3 comments 7 replies
-
Thanks for the details you offered. Quickly go through your code, it seems that the code is well orgnaized and I didn't find bugs. to debug:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Possible reason: the primary field "url_id" has duplicate ids. |
Beta Was this translation helpful? Give feedback.
4 replies
-
Verify the embedding data by the following steps:
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using Milvus to store 2.4M Wikipedia document chunks with OpenAI embeddings (
text-embedding-3-small
) to power a Retrieval-Augmented Generation (RAG) system. However, I am encountering unexpected retrieval issues:Setup Details
Milvus Version: 2.5.4
Embedding Model: text-embedding-3-small
Embedding Dimension: 1536
Chunk Sizes Tried: 256, 1024
Chunk Overlap: 50
Total Documents: 2462190
Milvus Index Used: FLAT, HNSW
Metric Types Tried: COSINE and L2
TopK : 5,10,20,100, 500
Database Fields:
url_id
(Primary Key)embedding
(FLOAT_VECTOR, dim=1536)text
(VARCHAR, max_length=5000)Example
I ran multiple queries extracted from stored Wikipedia chunks, expecting at least
top_k=5
relevant documents to match the query source. However, the retrieved results are completely unrelated.#Example 1
Query:
"What television show did Rauch regularly contribute to?"
Expected: The top results should reference Melissa Rauch, who contributed to Best Week Ever.
Observed Results:
10
documents are relevant.0.09 - 0.1
).(Full JSON results attached below for reference.)
Attached scripts:
create_doc_store.txt- this file operates in 3 modes
RAG.txt - This file reads the queries and generates embedding and retrieves topk documents from db to generate LLM answer
Questions for the Milvus Team
Why are similarity scores so low (~0.05) even after trying many configuration options? Shouldn’t similar vectors have scores close to
1
?Why does FLAT fail to return exact matches?
Could this be an indexing/storage issue or any other issues?
2.4M
documents)?If a query has approximately 100 related document chunks (extracted from the same source document as the query) within a pool of 2.4M documents, would you expect these 100 related chunks to appear in the top-k results when setting top_k=10 or top_k=20? If not, what factors could be causing their exclusion from the top-k results?
results.json
Any guidance on why this issue is happening and possible solutions would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions