Skip to content

Commit 0eaf7d0

Browse files
michaelsembwevertzolov
authored andcommitted
Cassandra Vector Store initial impl follow up
- add concurrency to store.add(..) (bc embeddingClient is slow) - CassandraVectorStoreAutoConfiguration uses CassandraAutoConfiguration - driver profiles for production stability+performance, - small cleanups and naming fixes, - main doc tidy-up - astradb compatibility (protocol V4) – don't create embeddings again for documents that already have them similar to #413
1 parent f698902 commit 0eaf7d0

File tree

13 files changed

+315
-271
lines changed

13 files changed

+315
-271
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/vectordbs/apache-cassandra.adoc

Lines changed: 25 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ This section walks you through setting up `CassandraVectorStore` to store docume
44

55
== What is Apache Cassandra ?
66

7-
link:https://cassandra.apache.org[Apache Cassandra] is a true open source distributed database reknown for scalability and high availability without compromising performance.
7+
link:https://cassandra.apache.org[Apache Cassandra®] is a true open source distributed database reknown for linear scalability, proven fault-tolerance and low latency, making it the perfect platform for mission-critical transactional data.
88

9-
Linear scalability, proven fault-tolerance and low latency on commodity hardware makes it the perfect platform for mission-critical data. Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.
9+
Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.
1010

1111
A vector search in Apache Cassandra is done as simply as:
1212
```
@@ -15,9 +15,13 @@ SELECT content FROM table ORDER BY content_vector ANN OF query_embedding ;
1515

1616
More docs on this can be read https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html[here].
1717

18-
The Spring AI Cassandra Vector Store is designed to work for both brand new RAG applications as well as being able to be retrofitted on top of existing data and tables. This vector store may also equally be used for non-RAG non_AI use-cases, e.g. semantic searcing in an existing database. The Vector Store will automatically create, or enhance, the schema as needed according to its configuration. If you don't want the schema modifications, configure the store with `disallowSchemaChanges`.
18+
This Spring AI Vector Store is designed to work for both brand new RAG applications as well as being able to be retrofitted on top of existing data and tables.
1919

20-
== What is JVector Vector Search ?
20+
The store can also be used for non-RAG use-cases in an existing database, e.g. semantic searches, geo-proximity searches, etc.
21+
22+
The store will automatically create, or enhance, the schema as needed according to its configuration. If you don't want the schema modifications, configure the store with `disallowSchemaChanges`.
23+
24+
== What is JVector ?
2125

2226
link:https://github.com/jbellis/jvector[JVector] is a pure Java embedded vector search engine.
2327

@@ -70,13 +74,6 @@ Add these dependencies to your project:
7074

7175
TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.
7276

73-
* If for example you want to use the OpenAI modules, remember to provide your OpenAI API Key. Set it as an environment variable like so:
74-
75-
[source,bash]
76-
----
77-
export SPRING_AI_OPENAI_API_KEY='Your_OpenAI_API_Key'
78-
----
79-
8077

8178
== Usage
8279

@@ -93,21 +90,14 @@ public VectorStore vectorStore(EmbeddingClient embeddingClient) {
9390
}
9491
----
9592

96-
NOTE: It is more convenient and preferred to create the `CassandraVectorStore` as a Bean.
97-
But if you decide you can create it manually.
98-
9993
[NOTE]
10094
====
101-
The default configuration connects to Cassandra at localhost:9042 and will automatically create the default schema at `springframework_ai_vector.springframework_ai_vector_store`.
102-
103-
Please see `CassandraVectorStoreConfig.Builder` for all the configuration options.
95+
The default configuration connects to Cassandra at `localhost:9042` and will automatically create a default schema in keyspace `springframework`, table `ai_vector_store`.
10496
====
10597

10698
[NOTE]
10799
====
108-
The Cassandra Java Driver is easiest configured via the `application.conf` file on the classpath.
109-
110-
More info can be found link: https://github.com/apache/cassandra-java-driver/tree/4.x/manual/core/configuration[here].
100+
The Cassandra Java Driver is easiest configured via an `application.conf` file on the classpath. More info https://github.com/apache/cassandra-java-driver/tree/4.x/manual/core/configuration[here].
111101
====
112102

113103
Then in your main code, create some documents:
@@ -148,7 +138,7 @@ List<Document> results = vectorStore.similaritySearch(
148138

149139
=== Metadata filtering
150140

151-
You can leverage the generic, portable link:https://docs.spring.io/spring-ai/reference/api/vectordbs.html#_metadata_filters[metadata filters] with the CassandraVectorStore as well. Metadata fields must be configured in `CassandraVectorStoreConfig`.
141+
You can leverage the generic, portable link:https://docs.spring.io/spring-ai/reference/api/vectordbs.html#_metadata_filters[metadata filters] with the CassandraVectorStore as well. Metadata columns must be configured in `CassandraVectorStoreConfig`.
152142

153143
For example, you can use either the text expression language:
154144

@@ -173,7 +163,9 @@ vectorStore.similaritySearch(
173163

174164
The portable filter expressions get automatically converted into link:https://cassandra.apache.org/doc/latest/cassandra/developing/cql/index.html[CQL queries].
175165

176-
Metadata fields to be searchable need to be either primary key columns or SAI indexed. To do this configure the metadata field with the `SchemaColumnTags.INDEXED`.
166+
For metadata columns to be searchable they must be either primary keys or SAI indexed. To make non-primary-key columns indexed configure the metadata column with the `SchemaColumnTags.INDEXED`.
167+
168+
177169

178170

179171
== Advanced Example: Vector Store ontop full Wikipedia dataset
@@ -187,7 +179,8 @@ Create the schema in the Cassandra database first:
187179

188180
[source,bash]
189181
----
190-
wget https://raw.githubusercontent.com/datastax-labs/colbert-wikipedia-data/main/schema.cql -O colbert-wikipedia-schema.cql
182+
wget https://s.apache.org/colbert-wikipedia-schema-cql -O colbert-wikipedia-schema.cql
183+
191184
cqlsh -f colbert-wikipedia-schema.cql
192185
----
193186

@@ -212,14 +205,14 @@ public CassandraVectorStore store(EmbeddingClient embeddingClient) {
212205
.withTableName("articles")
213206
.withPartitionKeys(partitionColumns)
214207
.withClusteringKeys(clusteringColumns)
215-
.withContentFieldName("body")
216-
.withEmbeddingFieldName("all_minilm_l6_v2_embedding")
208+
.withContentColumnName("body")
209+
.withEmbeddingColumndName("all_minilm_l6_v2_embedding")
217210
.withIndexName("all_minilm_l6_v2_ann")
218211
.disallowSchemaChanges()
219-
.addMetadataFields(extraColumns)
212+
.addMetadataColumns(extraColumns)
220213
.withPrimaryKeyTranslator((List<Object> primaryKeys) -> {
221-
// the deliminator used to join fields together into the document's id
222-
// is arbitary, here "§¶" is used
214+
// the deliminator used to join fields together into the document's id is arbitary
215+
// here "§¶" is used
223216
if (primaryKeys.isEmpty()) {
224217
return "test§¶0";
225218
}
@@ -243,8 +236,11 @@ public EmbeddingClient embeddingClient() {
243236
}
244237
----
245238

239+
240+
== Complete wikipedia dataset
241+
246242
And, if you would like to load the full wikipedia dataset.
247-
First download the `simplewiki-sstable.tar` from this link https://drive.google.com/file/d/1CcMMsj8jTKRVGep4A7hmOSvaPepsaKYP/view?usp=share_link . This will take a while, the file is tens of GBs.
243+
First download the `simplewiki-sstable.tar` from this link https://s.apache.org/simplewiki-sstable-tar . This will take a while, the file is tens of GBs.
248244

249245
[source,bash]
250246
----

spring-ai-spring-boot-autoconfigure/src/main/java/org/springframework/ai/autoconfigure/vectorstore/cassandra/CassandraConnectionDetails.java

Lines changed: 0 additions & 37 deletions
This file was deleted.

spring-ai-spring-boot-autoconfigure/src/main/java/org/springframework/ai/autoconfigure/vectorstore/cassandra/CassandraVectorStoreAutoConfiguration.java

Lines changed: 27 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,17 @@
1515
*/
1616
package org.springframework.ai.autoconfigure.vectorstore.cassandra;
1717

18-
import java.net.InetSocketAddress;
19-
import java.util.Arrays;
20-
import java.util.List;
18+
import java.time.Duration;
2119

22-
import com.google.common.base.Preconditions;
20+
import com.datastax.oss.driver.api.core.CqlSession;
21+
import com.datastax.oss.driver.api.core.config.DefaultDriverOption;
2322

2423
import org.springframework.ai.embedding.EmbeddingClient;
2524
import org.springframework.ai.vectorstore.CassandraVectorStore;
2625
import org.springframework.ai.vectorstore.CassandraVectorStoreConfig;
2726
import org.springframework.boot.autoconfigure.AutoConfiguration;
27+
import org.springframework.boot.autoconfigure.cassandra.CassandraAutoConfiguration;
28+
import org.springframework.boot.autoconfigure.cassandra.DriverConfigLoaderBuilderCustomizer;
2829
import org.springframework.boot.autoconfigure.condition.ConditionalOnClass;
2930
import org.springframework.boot.autoconfigure.condition.ConditionalOnMissingBean;
3031
import org.springframework.boot.context.properties.EnableConfigurationProperties;
@@ -34,37 +35,24 @@
3435
* @author Mick Semb Wever
3536
* @since 1.0.0
3637
*/
37-
@AutoConfiguration
38-
@ConditionalOnClass({ CassandraVectorStore.class, EmbeddingClient.class })
38+
@AutoConfiguration(after = CassandraAutoConfiguration.class)
39+
@ConditionalOnClass({ CassandraVectorStore.class, EmbeddingClient.class, CqlSession.class })
3940
@EnableConfigurationProperties(CassandraVectorStoreProperties.class)
4041
public class CassandraVectorStoreAutoConfiguration {
4142

42-
@Bean
43-
@ConditionalOnMissingBean(CassandraConnectionDetails.class)
44-
public PropertiesCassandraConnectionDetails cassandraConnectionDetails(CassandraVectorStoreProperties properties) {
45-
return new PropertiesCassandraConnectionDetails(properties);
46-
}
47-
4843
@Bean
4944
@ConditionalOnMissingBean
5045
public CassandraVectorStore vectorStore(EmbeddingClient embeddingClient, CassandraVectorStoreProperties properties,
51-
CassandraConnectionDetails cassandraConnectionDetails) {
46+
CqlSession cqlSession) {
5247

53-
var builder = CassandraVectorStoreConfig.builder();
54-
if (cassandraConnectionDetails.hasCassandraContactPoints()) {
55-
for (InetSocketAddress contactPoint : cassandraConnectionDetails.getCassandraContactPoints()) {
56-
builder = builder.addContactPoint(contactPoint);
57-
}
58-
}
59-
if (cassandraConnectionDetails.hasCassandraLocalDatacenter()) {
60-
builder = builder.withLocalDatacenter(cassandraConnectionDetails.getCassandraLocalDatacenter());
61-
}
48+
var builder = CassandraVectorStoreConfig.builder().withCqlSession(cqlSession);
6249

6350
builder = builder.withKeyspaceName(properties.getKeyspace())
6451
.withTableName(properties.getTable())
65-
.withContentColumnName(properties.getContentFieldName())
66-
.withEmbeddingColumnName(properties.getEmbeddingFieldName())
67-
.withIndexName(properties.getIndexName());
52+
.withContentColumnName(properties.getContentColumnName())
53+
.withEmbeddingColumnName(properties.getEmbeddingColumnName())
54+
.withIndexName(properties.getIndexName())
55+
.withFixedThreadPoolExecutorSize(properties.getFixedThreadPoolExecutorSize());
6856

6957
if (properties.getDisallowSchemaCreation()) {
7058
builder = builder.disallowSchemaChanges();
@@ -73,46 +61,20 @@ public CassandraVectorStore vectorStore(EmbeddingClient embeddingClient, Cassand
7361
return new CassandraVectorStore(builder.build(), embeddingClient);
7462
}
7563

76-
private static class PropertiesCassandraConnectionDetails implements CassandraConnectionDetails {
77-
78-
private final CassandraVectorStoreProperties properties;
79-
80-
public PropertiesCassandraConnectionDetails(CassandraVectorStoreProperties properties) {
81-
this.properties = properties;
82-
}
83-
84-
private String[] getCassandraContactPointHosts() {
85-
return this.properties.getCassandraContactPointHosts().split("(,| )");
86-
}
87-
88-
@Override
89-
public List<InetSocketAddress> getCassandraContactPoints() {
90-
91-
Preconditions.checkState(hasCassandraContactPoints(), "cassandraContactPointHosts has not been set");
92-
final int port = this.properties.getCassandraContactPointPort();
93-
94-
return Arrays.asList(getCassandraContactPointHosts())
95-
.stream()
96-
.map((host) -> InetSocketAddress.createUnresolved(host, port))
97-
.toList();
98-
}
99-
100-
@Override
101-
public String getCassandraLocalDatacenter() {
102-
Preconditions.checkState(hasCassandraLocalDatacenter(), "cassandraLocalDatacenter has not been set");
103-
return this.properties.getCassandraLocalDatacenter();
104-
}
105-
106-
@Override
107-
public boolean hasCassandraContactPoints() {
108-
return null != this.properties.getCassandraContactPointHosts();
109-
}
110-
111-
@Override
112-
public boolean hasCassandraLocalDatacenter() {
113-
return null != this.properties.getCassandraLocalDatacenter();
114-
}
115-
64+
@Bean
65+
public DriverConfigLoaderBuilderCustomizer driverConfigLoaderBuilderCustomizer() {
66+
// this replaces spring-ai-cassandra-*.jar!application.conf
67+
// as spring-boot autoconfigure will not resolve the default driver configs
68+
return (builder) -> builder.startProfile(CassandraVectorStore.DRIVER_PROFILE_UPDATES)
69+
.withString(DefaultDriverOption.REQUEST_CONSISTENCY, "LOCAL_QUORUM")
70+
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofSeconds(1))
71+
.withBoolean(DefaultDriverOption.REQUEST_DEFAULT_IDEMPOTENCE, true)
72+
.endProfile()
73+
.startProfile(CassandraVectorStore.DRIVER_PROFILE_SEARCH)
74+
.withString(DefaultDriverOption.REQUEST_CONSISTENCY, "LOCAL_ONE")
75+
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofSeconds(10))
76+
.withBoolean(DefaultDriverOption.REQUEST_DEFAULT_IDEMPOTENCE, true)
77+
.endProfile();
11678
}
11779

11880
}

spring-ai-spring-boot-autoconfigure/src/main/java/org/springframework/ai/autoconfigure/vectorstore/cassandra/CassandraVectorStoreProperties.java

Lines changed: 18 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
*/
1616
package org.springframework.ai.autoconfigure.vectorstore.cassandra;
1717

18+
import com.google.api.client.util.Preconditions;
19+
1820
import org.springframework.ai.vectorstore.CassandraVectorStoreConfig;
1921
import org.springframework.boot.context.properties.ConfigurationProperties;
2022

@@ -27,12 +29,6 @@ public class CassandraVectorStoreProperties {
2729

2830
public static final String CONFIG_PREFIX = "spring.ai.vectorstore.cassandra";
2931

30-
private String cassandraContactPointHosts = null;
31-
32-
private int cassandraContactPointPort = 9042;
33-
34-
private String cassandraLocalDatacenter = null;
35-
3632
private String keyspace = CassandraVectorStoreConfig.DEFAULT_KEYSPACE_NAME;
3733

3834
private String table = CassandraVectorStoreConfig.DEFAULT_TABLE_NAME;
@@ -45,30 +41,7 @@ public class CassandraVectorStoreProperties {
4541

4642
private boolean disallowSchemaChanges = false;
4743

48-
public String getCassandraContactPointHosts() {
49-
return this.cassandraContactPointHosts;
50-
}
51-
52-
/** comma or space separated */
53-
public void setCassandraContactPointHosts(String cassandraContactPointHosts) {
54-
this.cassandraContactPointHosts = cassandraContactPointHosts;
55-
}
56-
57-
public int getCassandraContactPointPort() {
58-
return this.cassandraContactPointPort;
59-
}
60-
61-
public void setCassandraContactPointPort(int cassandraContactPointPort) {
62-
this.cassandraContactPointPort = cassandraContactPointPort;
63-
}
64-
65-
public String getCassandraLocalDatacenter() {
66-
return this.cassandraLocalDatacenter;
67-
}
68-
69-
public void setCassandraLocalDatacenter(String cassandraLocalDatacenter) {
70-
this.cassandraLocalDatacenter = cassandraLocalDatacenter;
71-
}
44+
private int fixedThreadPoolExecutorSize = CassandraVectorStoreConfig.DEFAULT_ADD_CONCURRENCY;
7245

7346
public String getKeyspace() {
7447
return this.keyspace;
@@ -94,20 +67,20 @@ public void setIndexName(String indexName) {
9467
this.indexName = indexName;
9568
}
9669

97-
public String getContentFieldName() {
70+
public String getContentColumnName() {
9871
return this.contentColumnName;
9972
}
10073

101-
public void setContentFieldName(String contentFieldName) {
102-
this.contentColumnName = contentFieldName;
74+
public void setContentColumnName(String contentColumnName) {
75+
this.contentColumnName = contentColumnName;
10376
}
10477

105-
public String getEmbeddingFieldName() {
78+
public String getEmbeddingColumnName() {
10679
return this.embeddingColumnName;
10780
}
10881

109-
public void setEmbeddingFieldName(String embeddingFieldName) {
110-
this.embeddingColumnName = embeddingFieldName;
82+
public void setEmbeddingColumnName(String embeddingColumnName) {
83+
this.embeddingColumnName = embeddingColumnName;
11184
}
11285

11386
public Boolean getDisallowSchemaCreation() {
@@ -118,4 +91,13 @@ public void setDisallowSchemaCreation(boolean disallowSchemaCreation) {
11891
this.disallowSchemaChanges = disallowSchemaCreation;
11992
}
12093

94+
public int getFixedThreadPoolExecutorSize() {
95+
return this.fixedThreadPoolExecutorSize;
96+
}
97+
98+
public void setFixedThreadPoolExecutorSize(int fixedThreadPoolExecutorSize) {
99+
Preconditions.checkArgument(0 < fixedThreadPoolExecutorSize);
100+
this.fixedThreadPoolExecutorSize = fixedThreadPoolExecutorSize;
101+
}
102+
121103
}

0 commit comments

Comments
 (0)