Skip to content

Enhancing Elasticsearch vector store implementation #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

l-trotta
Copy link
Contributor

This PR provides a more performant search function for the Elasticsearch vector store and removes the bean autoconfiguration.

In depth

Autoconfiguration

The current implementation of afterPropertiesSet() automatically creates a new index with a set of properties that only works with OpenAI, or any other model that works with vectors with a dimension of 1536; users adopting other models would currently have to manually delete the index. By default, Elasticsearch automatically creates the correct index settings when it receives the first PUT request for vectors, so in our opinion there's no need for such autoconfiguration.

Search

The function used now is script_score, which can be slow for large data samples since it does a brute force comparison with all vectors using the similarity function. This PR replaces it with the approximate knn search, more performant because it only scans the closest neighbours. The similarity functions available for knn can be easily configured in the index mapping by setting the correct name.

We would also like to contribute to the documentation, should that be done in a different PR?

Thank you @JM-Lab for the original implementation, would you like to review these changes?

(Disclosure: I work for Elastic)

@JM-Lab
Copy link
Contributor

JM-Lab commented Apr 17, 2024

Hi @l-trotta,

Thanks for the update. I liked how script_score was quick to implement due to available examples, but I'm pleased we're switching to KNN for better performance.

I find the automatic creation of index mappings convenient; however, for my projects, I typically need to define these mappings in advance. Could we consider adding an option to provide index mappings directly in the constructor?

Additionally, could you check if the code at ElasticsearchAiSearchFilterExpressionConverter.java needs any improvements?

I'm also looking forward to the official documentation for the Elasticsearch vector store of the Spring AI project, which I think @tzolov will address.

@tzolov tzolov added enhancement New feature or request vector store labels Apr 20, 2024
@l-trotta
Copy link
Contributor Author

Hi @JM-Lab,

Since the index mapping can be added with createIndexMapping(), and in most cases the autoconfiguration is enough, we chose not to add the mapping to the constructor to make it easier to configure it at the start.

The filter converter looked fine to me!
Thanks again for your work.

@tzolov tzolov self-assigned this Apr 26, 2024
@tzolov tzolov added this to the 1.0.0-M1 milestone Apr 26, 2024
@tzolov
Copy link
Contributor

tzolov commented Apr 30, 2024

@l-trotta , thank you for the improvements.
Could you please rebase and update your PR after the #633 merge.

@l-trotta
Copy link
Contributor Author

l-trotta commented May 6, 2024

Updated! I'd like to explain why I removed the dense-vector-indexing property: quoting the documentation,

If true, you can search this field using the kNN search API. Defaults to true.

And since this implementation uses kNN search, setting this to false would make the data unsearchable.

@tzolov tzolov modified the milestones: 1.0.0-M1, 1.0.0-M2 May 28, 2024
@ezimuel
Copy link

ezimuel commented Jun 10, 2024

@tzolov what's the status of this PR? Thanks.

@tzolov
Copy link
Contributor

tzolov commented Jun 16, 2024

@l-trotta thanks for your contribution and thank you for your patience.

While reviewing the PR i've noticed that you have removed the capability to override the similarityFunction at runtime?
Is there any particular reason to reduce this flexibility? Either way it renders this ElasticsearchVectorStoreAutoConfigurationIT almost useless?
If there are not strong reasons to have this immutable perhaps we should re-enable the withSimilarityFunction setter?

Copy link
Contributor

@tzolov tzolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @l-trotta ,
It looks good but there few things that needs clarification/fixing. Please check my comments.
Looking forward for merging this improvement.

@@ -281,6 +282,12 @@
<optional>true</optional>
</dependency>

<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to add the elasticsearch-java explicitly here? As it is already defined in the vector-store dependenies?

this.similarityFunction = COSINE_SIMILARITY_FUNCTION;
}

public ElasticsearchVectorStore withSimilarityFunction(String similarityFunction) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required to remove the ability to change the similarity function at runtime?
This by the way breaks the ElasticsearchVectorStoreAutoConfigurationIT tests.

@l-trotta
Copy link
Contributor Author

l-trotta commented Jun 17, 2024

@tzolov Thank you for the review, I'll explain my changes:

  • Removal of the withSimilarityFunction setter: in elasticsearch the similarity function is defined as a mapping property of the index and cannot be changed, trying to update the property in the mapping will result in the following error:
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": """Mapper for [embedding] conflicts with existing mapper:
        Cannot update parameter [similarity] from [dot_product] to [cosine]"""
          }
        ],
        "type": "illegal_argument_exception",
        "reason": """Mapper for [embedding] conflicts with existing mapper:
        Cannot update parameter [similarity] from [dot_product] to [cosine]"""
      },
      "status": 400
    }
    
    so the only way to change mapping is to create another index, and in this case a new vector store instance.
  • Having elasticsearch-java explicitly defined in autoconfiguration: while testing the autoconfiguration I noticed it was importing an older version of the client (8.10.4), which breaks the current implementation of the vector store since the knn query was recently fixed with version 8.13.3, so I wanted to make sure that this would be the version used, but maybe it's an issue with my local environment? If everything works without that it can be removed.

I didn't notice ElasticsearchVectorStoreAutoConfigurationIT being broken, sorry for that, I updated it.

@tzolov
Copy link
Contributor

tzolov commented Jun 17, 2024

Thanks for the clarification @l-trotta !

It seems latest Spring Boot 3.3.x (the one we use for our auto-configs) is already at Elasticsearch Client 8.13.4.
So I guess we are fine without the forced dependencies?

Perhaps you can mention (in a separate PR) this requirement in the docs. E.g. if someone is using pre 3.3. Boot they have to add the elastic search 8.13.4 dependency to their poms?

@tzolov
Copy link
Contributor

tzolov commented Jun 17, 2024

rebased, squashed and merged at 78f3797

@tzolov tzolov closed this Jun 17, 2024
@tzolov
Copy link
Contributor

tzolov commented Jun 17, 2024

@l-trotta thank you for the great work!
Just merged the PR.
Please check if the dependencies work as expected and consider adding a NOTE in the documentation as suggested above.

@l-trotta
Copy link
Contributor Author

@tzolov thank you! I'll test with an older boot version, if there could be version problems I'll open another PR for the documentation.

@l-trotta
Copy link
Contributor Author

is there any estimation as for when these changes will be released?

@ezimuel
Copy link

ezimuel commented Aug 7, 2024

@tzolov do you know when this PR will be public released? We would like to prepare some materials to promote this integration for Elasticsearch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request vector store
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants