Skip to content

.Net MEVD: Port the Pinecone connector to use Pinecone.Client #10788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Mar 4, 2025

This PR ports the Pinecone connector to use Pinecone.Client, adds a lot of integration tests that use local emulator.

It does not implement the new LINQ filter, so it just contributes to #10415.

It also provides a lot of conformance tests, so it contributes to #10194

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Mar 4, 2025
…o pineconeOfficial

# Conflicts:
#	dotnet/SK-dotnet.sln
#	dotnet/src/Connectors/Connectors.Memory.Pinecone/PineconeVectorStoreRecordCollection.cs
- Collection conformance tests: CRD for the collection itself
- Record conformance tests: CRUD for single record
- implement and fix them for Redis, PostgreSQL and SQL Server as well
- VectorizedSearchAsync conformance tests
@adamsitnik adamsitnik marked this pull request as ready for review March 10, 2025 18:40
@adamsitnik adamsitnik requested a review from a team as a code owner March 10, 2025 18:40
@adamsitnik adamsitnik requested review from roji and westey-m March 11, 2025 12:00
@westey-m westey-m deleted the branch microsoft:feature-vector-data-preb1 March 11, 2025 18:26
@westey-m westey-m closed this Mar 11, 2025
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to see this @adamsitnik, thanks ❤️

I have various thoughts on how to make the integration tests a bit better, but nothing is really important there. At some point I'll probably just make a pass and reorganize a bit - we shouldn't get delayed by it too much.

this._index ??= await this._pineconeClient.GetIndex(indexName, cancellationToken).ConfigureAwait(false);
if (this._indexClient is null)
{
#if NON_PUBLISH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a different build, shouldn't we just allow uesrs to set the host on PineconeVectorStoreOptions, and use that here? It'd be null by default (so good for production), but some users may want to actually use the emulator themselves with the connector, and would set PineconeVectorStoreOptions.Host for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the best solution would be to fix it on the Pinecone side. Let me fork their code and get back to you.

{
await this.CreateCollectionAsync(cancellationToken).ConfigureAwait(false);
}
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException apiEx && apiEx.InnerException is ConflictError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're not a fan of pattern matching but:

Suggested change
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException apiEx && apiEx.InnerException is ConflictError)
catch (VectorStoreOperationException ex) when (ex.InnerException is PineconeApiException { InnerException: ConflictError })

// Map the results.
Sdk.QueryRequest request = new()
{
TopK = (uint)(options.Top + options.Skip),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Should it really be the sum of Top and Skip?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API does not provide a possibility to skip, so we ask for Top + Skip many records, and then manually skip Skip-many records. I know it's not optimal, but the API does not expose anything for skipping.

public record QueryRequest
{
    [JsonPropertyName("namespace")]
    public string? Namespace { get; set; }

    [JsonPropertyName("topK")]
    public required uint TopK { get; set; }

    [JsonPropertyName("filter")]
    public Metadata? Filter { get; set; }

    [JsonPropertyName("includeValues")]
    public bool? IncludeValues { get; set; }

    [JsonPropertyName("includeMetadata")]
    public bool? IncludeMetadata { get; set; }

    [JsonPropertyName("queries")]
    public IEnumerable<QueryVector>? Queries { get; set; }

    [JsonPropertyName("vector")]
    public ReadOnlyMemory<float>? Vector { get; set; }

    [JsonPropertyName("sparseVector")]
    public SparseValues? SparseVector { get; set; }

    [JsonPropertyName("id")]
    public string? Id { get; set; }

Source: https://github.com/pinecone-io/pinecone-dotnet-client/blob/f2f39c9b87ac7c1f939a0c03dfa18ba60a66ad0c/src/Pinecone/Index/Requests/QueryRequest.cs#L9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't supported, then it might be better to simply throw if Skip is non-zero... Skip and Top are typically meant for pagination, where the whole point is to not bring back a ton of data from the database, but to read a window of result each time. This sort of client-side implementation means that as the window moves forward, we each time continue to fetch all the records up to the window, which can be a lot of data... So this introduces an undiscoverable performance pit of failure...

Given that users can always just implement this themselves (just do Skip + TopK themselves and pass the result via Top), I'd err on forcing users to do this extra explicitly, so they're 100% aware of the big fetch (and the code expresses it). I get that users would typically rather have everything "just work", but at the end of the day our main job here is to expose database capabilities, rather than attempt to fill in gaps in the database... FWIW I suspect that Pinecone will add a Skip feature sooner or later - I saw many databases go through this phase where they didn't have it and then added it.

{
await this.RunOperationAsync("FT.DROPINDEX", () => this._database.FT().DropIndexAsync(this._collectionName)).ConfigureAwait(false);
}
catch (VectorStoreOperationException ex) when (ex.InnerException is RedisServerException innerEx && innerEx.Message.Contains("Unknown Index name"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that there's no better way to identify this than shady string contains?

@@ -66,4 +66,65 @@ await this.ExecuteAsync(async collection =>
Assert.Equal("keys", ex.ParamName);
});
}

[ConditionalFact]
public Task CanInsertUpdateAndDelete_WithVectors()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can align naming conventions later, but I've gone with snake_case for the tests (see the filter tests for example); with the long sentences that test methods sometimes get to, that seems good for readability.

Note that this differs from the convention in this repo, but the vector data tests will be moving out to Microsoft.Extensions anyway later.

@@ -14,13 +16,16 @@ public abstract class ConformanceTestsBase<TKey, TRecord>(VectorStoreFixture fix

protected virtual VectorStoreRecordDefinition? GetRecordDefinition() => null;

protected async Task ExecuteAsync(Func<IVectorStoreRecordCollection<TKey, TRecord>, Task> test)
protected async Task ExecuteAsync(Func<IVectorStoreRecordCollection<TKey, TRecord>, Task> test, bool createCollection = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this creates and drops a collection in many cases where that's not actually required by the test, e.g. everything in BatchConformanceTests. Unless there's a good reason, we should try to only create/drop collections when that's necessary - it's a slow process, and in some data stores there are some asynchronicity issues (where the collection doesn't yet appear although the call returned etc.). I know it might cause a bit more headaches around test isolation (leftover data inserted from one test may be visible by another), but if we're careful that should be doable.

Basically all tests which just work on data (search, CRUD...) should ideally work on one collection that gets created once at the beginning, with a single schema and test data - that makes things very fast and easy (I wrote VectorStoreCollectionFixture to provide common infra to make that simple).

What do you think @adamsitnik?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I was lucky to work with DBs where creating the collections is very fast, so I've not run into the problem you are trying to solve.

I know it might cause a bit more headaches around test isolation (leftover data inserted from one test may be visible by another)

We could mitigate that by generating more random Keys (as of now all of them start from seed equal 1 and just increment it).

There are few things about VectorStoreCollectionFixture that I don't like:

  • it requires me to create a derived type for every test class (I know that I can reuse the fixture types)
  • when looking at test, I don't see what records exactly are being used (I need to go to VectorStoreCollectionFixture.BuildTestData to check it)
  • it pre-inserts some records on my behalf, which is something I would like to avoid when testing CRUD

I'll take a look and see what I can do about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it requires me to create a derived type for every test class (I know that I can reuse the fixture types)

You mean the concrete fixture derived type? Don't we need some derived fixture type in any case?

it pre-inserts some records on my behalf, which is something I would like to avoid when testing CRUD

Oh this should be completely optional, only if you need it (if it's not completely optional, we should change things so that it is).

: ConformanceTestsBase<TKey, SimpleModel<TKey>>(fixture) where TKey : notnull
{
[ConditionalFact]
public async Task ReadingAndDeletingNonExistingRecordDoesNotThrow()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a bit more separate where each test exercises one thing, if possible (this one exercises both Get and Delete, the one below even more things).


namespace VectorDataSpecificationTests.CRUD;

public class CollectionConformanceTests<TKey>(VectorStoreFixture fixture)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: collection management generally isn't considered CRUD (the latter usually refers to operations on the rows/records inside).

github-merge-queue bot pushed a commit that referenced this pull request Mar 20, 2025
It's a resurrected #10788 built on top of #10944 which I hope will get
merged first.

fixes #10415

---------

Co-authored-by: westey <164392973+westey-m@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants