-
Notifications
You must be signed in to change notification settings - Fork 4k
.Net MEVD: Port the Pinecone connector to use Pinecone.Client #10788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net MEVD: Port the Pinecone connector to use Pinecone.Client #10788
Conversation
…cal emulator behind the scenes
…nc and DeleteCollectionAsync and Upsert*Async
…o pineconeOfficial # Conflicts: # dotnet/SK-dotnet.sln # dotnet/src/Connectors/Connectors.Memory.Pinecone/PineconeVectorStoreRecordCollection.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice to see this @adamsitnik, thanks ❤️
I have various thoughts on how to make the integration tests a bit better, but nothing is really important there. At some point I'll probably just make a pass and reorganize a bit - we shouldn't get delayed by it too much.
this._index ??= await this._pineconeClient.GetIndex(indexName, cancellationToken).ConfigureAwait(false); | ||
if (this._indexClient is null) | ||
{ | ||
#if NON_PUBLISH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a different build, shouldn't we just allow uesrs to set the host on PineconeVectorStoreOptions, and use that here? It'd be null by default (so good for production), but some users may want to actually use the emulator themselves with the connector, and would set PineconeVectorStoreOptions.Host for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the best solution would be to fix it on the Pinecone side. Let me fork their code and get back to you.
{ | ||
await this.CreateCollectionAsync(cancellationToken).ConfigureAwait(false); | ||
} | ||
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException apiEx && apiEx.InnerException is ConflictError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're not a fan of pattern matching but:
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException apiEx && apiEx.InnerException is ConflictError) | |
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException { InnerException: ConflictError }) |
// Map the results. | ||
Sdk.QueryRequest request = new() | ||
{ | ||
TopK = (uint)(options.Top + options.Skip), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Should it really be the sum of Top and Skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API does not provide a possibility to skip, so we ask for Top + Skip
many records, and then manually skip Skip
-many records. I know it's not optimal, but the API does not expose anything for skipping.
public record QueryRequest
{
[JsonPropertyName("namespace")]
public string? Namespace { get; set; }
[JsonPropertyName("topK")]
public required uint TopK { get; set; }
[JsonPropertyName("filter")]
public Metadata? Filter { get; set; }
[JsonPropertyName("includeValues")]
public bool? IncludeValues { get; set; }
[JsonPropertyName("includeMetadata")]
public bool? IncludeMetadata { get; set; }
[JsonPropertyName("queries")]
public IEnumerable<QueryVector>? Queries { get; set; }
[JsonPropertyName("vector")]
public ReadOnlyMemory<float>? Vector { get; set; }
[JsonPropertyName("sparseVector")]
public SparseValues? SparseVector { get; set; }
[JsonPropertyName("id")]
public string? Id { get; set; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't supported, then it might be better to simply throw if Skip is non-zero... Skip and Top are typically meant for pagination, where the whole point is to not bring back a ton of data from the database, but to read a window of result each time. This sort of client-side implementation means that as the window moves forward, we each time continue to fetch all the records up to the window, which can be a lot of data... So this introduces an undiscoverable performance pit of failure...
Given that users can always just implement this themselves (just do Skip + TopK themselves and pass the result via Top), I'd err on forcing users to do this extra explicitly, so they're 100% aware of the big fetch (and the code expresses it). I get that users would typically rather have everything "just work", but at the end of the day our main job here is to expose database capabilities, rather than attempt to fill in gaps in the database... FWIW I suspect that Pinecone will add a Skip feature sooner or later - I saw many databases go through this phase where they didn't have it and then added it.
{ | ||
await this.RunOperationAsync("FT.DROPINDEX", () => this._database.FT().DropIndexAsync(this._collectionName)).ConfigureAwait(false); | ||
} | ||
catch (VectorStoreOperationException ex) when (ex.InnerException is RedisServerException innerEx && innerEx.Message.Contains("Unknown Index name")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that there's no better way to identify this than shady string contains?
@@ -66,4 +66,65 @@ await this.ExecuteAsync(async collection => | |||
Assert.Equal("keys", ex.ParamName); | |||
}); | |||
} | |||
|
|||
[ConditionalFact] | |||
public Task CanInsertUpdateAndDelete_WithVectors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can align naming conventions later, but I've gone with snake_case for the tests (see the filter tests for example); with the long sentences that test methods sometimes get to, that seems good for readability.
Note that this differs from the convention in this repo, but the vector data tests will be moving out to Microsoft.Extensions anyway later.
@@ -14,13 +16,16 @@ public abstract class ConformanceTestsBase<TKey, TRecord>(VectorStoreFixture fix | |||
|
|||
protected virtual VectorStoreRecordDefinition? GetRecordDefinition() => null; | |||
|
|||
protected async Task ExecuteAsync(Func<IVectorStoreRecordCollection<TKey, TRecord>, Task> test) | |||
protected async Task ExecuteAsync(Func<IVectorStoreRecordCollection<TKey, TRecord>, Task> test, bool createCollection = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this creates and drops a collection in many cases where that's not actually required by the test, e.g. everything in BatchConformanceTests. Unless there's a good reason, we should try to only create/drop collections when that's necessary - it's a slow process, and in some data stores there are some asynchronicity issues (where the collection doesn't yet appear although the call returned etc.). I know it might cause a bit more headaches around test isolation (leftover data inserted from one test may be visible by another), but if we're careful that should be doable.
Basically all tests which just work on data (search, CRUD...) should ideally work on one collection that gets created once at the beginning, with a single schema and test data - that makes things very fast and easy (I wrote VectorStoreCollectionFixture to provide common infra to make that simple).
What do you think @adamsitnik?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I was lucky to work with DBs where creating the collections is very fast, so I've not run into the problem you are trying to solve.
I know it might cause a bit more headaches around test isolation (leftover data inserted from one test may be visible by another)
We could mitigate that by generating more random Keys (as of now all of them start from seed equal 1
and just increment it).
There are few things about VectorStoreCollectionFixture
that I don't like:
- it requires me to create a derived type for every test class (I know that I can reuse the fixture types)
- when looking at test, I don't see what records exactly are being used (I need to go to
VectorStoreCollectionFixture.BuildTestData
to check it) - it pre-inserts some records on my behalf, which is something I would like to avoid when testing CRUD
I'll take a look and see what I can do about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it requires me to create a derived type for every test class (I know that I can reuse the fixture types)
You mean the concrete fixture derived type? Don't we need some derived fixture type in any case?
it pre-inserts some records on my behalf, which is something I would like to avoid when testing CRUD
Oh this should be completely optional, only if you need it (if it's not completely optional, we should change things so that it is).
: ConformanceTestsBase<TKey, SimpleModel<TKey>>(fixture) where TKey : notnull | ||
{ | ||
[ConditionalFact] | ||
public async Task ReadingAndDeletingNonExistingRecordDoesNotThrow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a bit more separate where each test exercises one thing, if possible (this one exercises both Get and Delete, the one below even more things).
|
||
namespace VectorDataSpecificationTests.CRUD; | ||
|
||
public class CollectionConformanceTests<TKey>(VectorStoreFixture fixture) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: collection management generally isn't considered CRUD (the latter usually refers to operations on the rows/records inside).
This PR ports the Pinecone connector to use Pinecone.Client, adds a lot of integration tests that use local emulator.
It does not implement the new LINQ filter, so it just contributes to #10415.
It also provides a lot of conformance tests, so it contributes to #10194