Skip to content

[MEVD] Integrate with MEAI (Embedding and IEmbeddingGenerator) #10492

@roji

Description

@roji

Background / Current status

MEVD is currently a separate package with no relation to Microsoft.Extensions.AI, which is where embedding-related types such as IEmbedding and IEmbeddingGenerator live. Because of this, MEVD doesn't currently expose APIs that make use of embedding types; the abstraction requires users to provide e.g. a ReadOnlyMemory<float>, requiring users to manually handle embedding generation outside MEVD. This makes MEVD a very low-level abstraction that's not easy to work with; a typical user of a vector database wants to set things up, and them simply run a vector search over e.g. some text, with embedding generation happening under the hood as an implementation detail. In other words, typical users should not need to deal directly with embeddings.

MEVD does contain IVectorizableTextSearch, which is an interface allowing doing a similarity search given a string; the string is vectorized automatically under the hood and the resulting embedding is sent. One reason for IVectorizableTextSearch is that (a) some database can take care of embedding generation themselves; connectors for such databases implement. The second reason (b) is to allow users to combine a vector database and a .NET embedding generator, exposing both via IVectorizableTextSearch implementation that runs the generator and passes the result to the database. The problem is, no such implementation for (b) currently exist - users have to provide an implementation, wire things up themselves in their application.

High-level proposal

First, in order for MEVD to provide a better, more high-level user experience, it must have access to IEmbedding/IEmbeddingGenerator. The proposal here is for MEVD to take a reference on MEAI in order to access these types; alternatives discussed where to move IEmbedding/IEmbeddingGenerator to 3rd place (M.E.Embeddings, or just System), and to merge MEVD and MEAI. I believe referencing MEAI from MEVD is non-controversial, so I won't go into the pros and cons of each option here (but can if needed).

DISCLAIMER: The below is my current view of the situation, I know @westey-m has a very different view (the point of this issue is to help hash things out)

  • IVectorStoreRecordCollection should support a search method that accepts any .NET type, representing unvectorized user content (this is similar to IVectorizableTextSearch.VectorizedSearchAsync, but not text-only to support multi-modal vectorization). This to be the primary way of performing vector search via the abstraction (as opposed to the more low-level API accepting an embedding).
    • Naming TBD, but my preference would be something like SearchAsync, which the uneducated user would choose as the first/default option - we should guide users to this API.
  • When setting up a IVectorStoreRecordCollection, the user should be able to optionally provide an IEmbeddingGenerator. For example, in DI configuration, an IEmbeddingGenerator service would be automatically picked up by the connectors' Add* methods. Then, when SearchAsync is called, that IEmbeddingGenerator would implicitly be used.
    • Note that an IVectorStoreRecordCollection can have multiple vector properties; as a result, the collection type is not generic over the vector embedding type, and can't be generic over the prevectorized input search type (e.g. string).
    • Since the SearchAsync method should accept any parameter type, and the input type to IEmbeddingGenerator can be anything, .NET typing does not enforce that the SearchAsync argument and the IEmbeddingGenerator input type match: a user can configure an IEmbeddingGenerator accepting a byte[], but call SearchAsync() passing a string; that's considered a user error, and this would cause a runtime exception.
    • Even when the .NET types align, IEmbeddingGenerators are implemented to support specific content types; if a user configures an IEmbeddingGenerator that expects a string base64 encoding of an image, but then passes text to SearchAsync, the resulting embeddings will be unusable (but no exception will be thrown - this is far worse).
    • A CLR type mismatch can also occur on the output side of the IEmbeddingGenerator as well: if an IEmbeddingGenerator is configured which outputs an embedding type that is unsupported by the vector database, an exception will be thrown. The same is true today when the user passes an embedding directly to MEVD: the user can provide whatever they want, and must know in advance what the database supports.
    • This all places the onus on the user the set things up correctly. It also means that swapping embedding generators isn't something that can be done lightly: if the new generator has a different input type, the user code will break. However, that will be the case regardless if the CLR type stays the same, but the content changes (different image type, or image vs. sound).
    • If no IEmbeddingGenerator was configured, calling SearchAsync will throw, and only the low-level API accepting embeddings can be used.
  • A collection can have multiple vector properties; one could represent text, another an image - this requires the ability to configure multiple embedding generators.
    • MEVD has VectorStoreRecordDefinition, which represents schema/modeling information on a collection. It specifically contains VectorStoreRecordVectorProperty for defining metadata on vector properties, like the similarity function and index type to be used.
    • We can allow the user to set an IEmbeddingGenerator on VectorStoreRecordVectorProperty, to define the the generator used when SearchAsync is invoked over that property.
    • If the generator on a property isn't set, we can fall back to the one configured globally on the collection. In that way, e.g. a DI-registered IEmbeddingGenerator can act as the default for all vector properties, unless overridden. This will likely correspond to the 99% usecases, where a single text embedding model is for all properties across all collections; but still allows for overriding on a property-by-property basis.
    • MEVD also allows users to provide metadata (similarity function, index type) via attributes on the .NET properties; it will not be possible to configure a property-specific embedding generator via this mechanism. Again, this is expected to be an advanced/rare scenario. We will also have other things which aren't configurable via the attributes (e.g. anything provider-specific, #10359).
  • The above scheme would require some changes/additions on the IEmbeddingGenerator side; to be usable from SearchAsync, we'd need the generic input type parameter to be on the GenerateAsync method (not on the type as it is currently). We'd also need the call to return a non-generic IEmbedding, since SearchAsync isn't generic over the specific embedding type (the non-generic IEmbedding would then passed to the vector database).

Notes

  • To the best of my understanding, the main objection to this scheme is that since there's no standard representation for e.g. an image, it's impossible to swap an IEmbeddingGenerator and be sure that things still work; one might require the image to be provided as a base64 string, the other as a byte[].
    • MEVD is already - very much by design - a leaky abstraction that does not allow e.g. swapping databases. For example, different databases support different key types (string, long, Guid...); similarly, different databases support different embeddings (float32, bfloat, binary...).
    • Even if we dictate that images are e.g. always provided as a base64 string, this makes no guarantees about e.g. the image format (PNG vs. JPG), or even that the base64 string represents an image as opposed to sound. The CLR type matching doesn't guarantee that the content is what's expected, and garbage output embeddings seem like a much worse outcome than a clear, fail-fast exception about a CLR type mismatch.
    • I'm assuming that some embedding generators might support multiple input formats (PNG, JPG), but that when they do, they probably require some sort of additional metadata (e.g. mime type) to tell them what the image format is. The way in which metadata is provided will be generator-dependent, and so probably cannot be part of the abstraction.
  • As a fundamental and relatively low-level abstraction, I don't believe MEVD should be making distinctions content types; for example, it should not have APIs that deal specifically with text, images or sound, have a sort of enum identifying content as images or sound, or be aware of MIME types. If we go down this route, we'll end up having to update the abstraction as new content types become relevant for vectorization. IMHO we should keep MEVD content-agnostic.
  • Similarly, I don't believe MEVD should dictate what .NET type is used to encode e.g. an image. There are many things out there that may need to get vectorized (image, video, sound... who knows what else), and the abstraction shouldn't be in the business of knowing them or determining what representation is best. In addition, an encoding that is standard today (e.g. base64 image encoding) may go out of fashion tomorrow.
    • This is also affects API usability: I wouldn't want to force users to wrap e.g. a byte[] image in some wrapper we impose.
    • I can also see IEmbeddingGenerators evolving in terms of compositionality; a PDFEmbeddingGenerator implement could accept a PDF as input, internally extracting text out of the PDF and passing it to a second, wrapped IEmbeddingGenerator. Configuring such an IEmbeddingGenerator would allow users to directly provide a PDF document to SearchAsync (in whatever representation is standard in .NET), and have them be transparently vectorized etc.

/cc @markwallace-microsoft @dmytrostruk @stephentoub @adamsitnik

Metadata

Metadata

Assignees

Labels

.NETIssue or Pull requests regarding .NET codeBuildFeatures planned for next Build conferencememorymemory connectormsft.ext.vectordataRelated to Microsoft.Extensions.VectorDatask team issueA tag to denote issues that where created by the Semantic Kernel team (i.e., not the community)

Type

No type

Projects

Status

Sprint: Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions