-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Enhancing Elasticsearch vector store implementation #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @l-trotta, Thanks for the update. I liked how script_score was quick to implement due to available examples, but I'm pleased we're switching to KNN for better performance. I find the automatic creation of index mappings convenient; however, for my projects, I typically need to define these mappings in advance. Could we consider adding an option to provide index mappings directly in the constructor? Additionally, could you check if the code at ElasticsearchAiSearchFilterExpressionConverter.java needs any improvements? I'm also looking forward to the official documentation for the Elasticsearch vector store of the Spring AI project, which I think @tzolov will address. |
Hi @JM-Lab, Since the index mapping can be added with The filter converter looked fine to me! |
Updated! I'd like to explain why I removed the
And since this implementation uses kNN search, setting this to false would make the data unsearchable. |
@tzolov what's the status of this PR? Thanks. |
@l-trotta thanks for your contribution and thank you for your patience. While reviewing the PR i've noticed that you have removed the capability to override the similarityFunction at runtime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @l-trotta ,
It looks good but there few things that needs clarification/fixing. Please check my comments.
Looking forward for merging this improvement.
@@ -281,6 +282,12 @@ | |||
<optional>true</optional> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to add the elasticsearch-java explicitly here? As it is already defined in the vector-store dependenies?
this.similarityFunction = COSINE_SIMILARITY_FUNCTION; | ||
} | ||
|
||
public ElasticsearchVectorStore withSimilarityFunction(String similarityFunction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it required to remove the ability to change the similarity function at runtime?
This by the way breaks the ElasticsearchVectorStoreAutoConfigurationIT
tests.
@tzolov Thank you for the review, I'll explain my changes:
I didn't notice |
Thanks for the clarification @l-trotta ! It seems latest Spring Boot 3.3.x (the one we use for our auto-configs) is already at Elasticsearch Client 8.13.4. Perhaps you can mention (in a separate PR) this requirement in the docs. E.g. if someone is using pre 3.3. Boot they have to add the elastic search |
rebased, squashed and merged at 78f3797 |
@l-trotta thank you for the great work! |
@tzolov thank you! I'll test with an older boot version, if there could be version problems I'll open another PR for the documentation. |
is there any estimation as for when these changes will be released? |
@tzolov do you know when this PR will be public released? We would like to prepare some materials to promote this integration for Elasticsearch. Thanks! |
This PR provides a more performant search function for the Elasticsearch vector store and removes the bean autoconfiguration.
In depth
Autoconfiguration
The current implementation of
afterPropertiesSet()
automatically creates a new index with a set of properties that only works with OpenAI, or any other model that works with vectors with a dimension of 1536; users adopting other models would currently have to manually delete the index. By default, Elasticsearch automatically creates the correct index settings when it receives the first PUT request for vectors, so in our opinion there's no need for such autoconfiguration.Search
The function used now is script_score, which can be slow for large data samples since it does a brute force comparison with all vectors using the similarity function. This PR replaces it with the approximate knn search, more performant because it only scans the closest neighbours. The similarity functions available for
knn
can be easily configured in the index mapping by setting the correct name.We would also like to contribute to the documentation, should that be done in a different PR?
Thank you @JM-Lab for the original implementation, would you like to review these changes?
(Disclosure: I work for Elastic)