Skip to content

Commit 656d238

Browse files
michaelsembwevertzolov
authored andcommitted
Implement Apache Cassandra vector store
The CassandraVectorStore is for managing and querying vector data in an Apache Cassandra db. It offers functionalities like adding, deleting, and performing similarity searches on documents. The store utilizes CQL to index and search vector data. It allows for custom metadata fields in the documents to be stored alongside the vector and content data. This class requires a CassandraVectorStoreConfig configuration object for initialization, which includes settings like connection details, index name, field names, etc. It also requires an EmbeddingClient to convert documents into embeddings before storing them. A schema matching the configuration is automatically created if it doesn't exist. Missing columns and indexes in existing tables will also be automatically created. Disable this with the disallowSchemaCreation. This class is designed to work with brand new tables that it creates for you, or on top of existing Cassandra tables. The latter is appropriate when wanting to keep data in place, creating embeddings next to it, and performing vector similarity searches in-situ. Instances of this class are not dynamic against server-side schema changes. If you change the schema server-side you need a new CassandraVectorStore instance. - Add auto-configure with tests. - reformat code style - Change field terminology to column (as appropriate for cassandra and cql) - Add doc page with an advanced example. - Add the dependencies to Spring AI BOM – add to `AutoConfiguration.imports` - Add @SInCE annotation - Fix javadoc issue - Streamline the adoc content and layout
1 parent fef1a42 commit 656d238

File tree

29 files changed

+3225
-12
lines changed

29 files changed

+3225
-12
lines changed

pom.xml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
<module>spring-ai-spring-boot-starters/spring-ai-starter-azure-openai</module>
3535
<module>spring-ai-spring-boot-starters/spring-ai-starter-ollama</module>
3636
<module>spring-ai-spring-boot-starters/spring-ai-starter-transformers</module>
37+
<module>spring-ai-spring-boot-starters/spring-ai-starter-cassandra</module>
3738
<module>spring-ai-spring-boot-starters/spring-ai-starter-chroma-store</module>
3839
<module>spring-ai-spring-boot-starters/spring-ai-starter-milvus-store</module>
3940
<module>spring-ai-spring-boot-starters/spring-ai-starter-pgvector-store</module>
@@ -47,6 +48,7 @@
4748
<module>spring-ai-spring-boot-starters/spring-ai-starter-qdrant-store</module>
4849
<module>spring-ai-spring-boot-starters/spring-ai-starter-postgresml-embedding</module>
4950
<module>spring-ai-docs</module>
51+
<module>vector-stores/spring-ai-cassandra</module>
5052
<module>vector-stores/spring-ai-pgvector-store</module>
5153
<module>vector-stores/spring-ai-hanadb-store</module>
5254
<module>vector-stores/spring-ai-milvus-store</module>
@@ -138,6 +140,7 @@
138140
<protobuf-java.version>3.25.2</protobuf-java.version>
139141

140142
<!-- readers/writer/stores dependencies-->
143+
<cassandra.java-driver.version>4.18.0</cassandra.java-driver.version>
141144
<pdfbox.version>3.0.1</pdfbox.version>
142145
<pgvector.version>0.1.4</pgvector.version>
143146
<sap.hanadb.version>2.20.11</sap.hanadb.version>

spring-ai-bom/pom.xml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,12 @@
132132
<version>${project.version}</version>
133133
</dependency>
134134

135+
<dependency>
136+
<groupId>org.springframework.ai</groupId>
137+
<artifactId>spring-ai-cassandra</artifactId>
138+
<version>${project.version}</version>
139+
</dependency>
140+
135141
<dependency>
136142
<groupId>org.springframework.ai</groupId>
137143
<artifactId>spring-ai-chroma-store</artifactId>
@@ -218,6 +224,12 @@
218224
</dependency>
219225

220226
<!-- Spring Boot Starters -->
227+
<dependency>
228+
<groupId>org.springframework.ai</groupId>
229+
<artifactId>spring-ai-apache-cassandra-store-spring-boot-starter</artifactId>
230+
<version>${project.version}</version>
231+
</dependency>
232+
221233
<dependency>
222234
<groupId>org.springframework.ai</groupId>
223235
<artifactId>spring-ai-azure-openai-spring-boot-starter</artifactId>

spring-ai-docs/src/main/antora/modules/ROOT/nav.adoc

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,23 +46,25 @@
4646
**** xref:api/audio/speech/openai-speech.adoc[OpenAI]
4747
** xref:api/vectordbs.adoc[]
4848
*** xref:api/vectordbs/azure.adoc[]
49+
*** xref:api/vectordbs/apache-cassandra.adoc[]
4950
*** xref:api/vectordbs/chroma.adoc[]
51+
*** xref:api/vectordbs/gemfire.adoc[GemFire]
5052
*** xref:api/vectordbs/milvus.adoc[]
5153
*** xref:api/vectordbs/neo4j.adoc[]
5254
*** xref:api/vectordbs/pgvector.adoc[]
53-
*** xref:api/vectordbs/weaviate.adoc[]
54-
*** xref:api/vectordbs/redis.adoc[]
5555
*** xref:api/vectordbs/pinecone.adoc[]
5656
*** xref:api/vectordbs/qdrant.adoc[]
57-
*** xref:api/vectordbs/gemfire.adoc[GemFire]
57+
*** xref:api/vectordbs/redis.adoc[]
5858
*** xref:api/vectordbs/hana.adoc[SAP Hana]
59+
*** xref:api/vectordbs/weaviate.adoc[]
60+
5961
** xref:api/functions.adoc[Function Calling]
6062
** xref:api/prompt.adoc[]
6163
** xref:api/output-parser.adoc[]
6264
** xref:api/etl-pipeline.adoc[]
6365
** xref:api/testing.adoc[]
6466
** xref:api/generic-model.adoc[]
65-
* xref:api/testcontainers.adoc[Testcontainers]
6667
* xref:contribution-guidelines.adoc[Contribution Guidelines]
6768
* Appendices
6869
** xref:upgrade-notes.adoc[]
70+
** xref:api/testcontainers.adoc[Testcontainers]

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/vectordbs.adoc

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,18 @@ Find more information on the `Filter.Expression` in the <<metadata-filters>> sec
8888
These are the available implementations of the `VectorStore` interface:
8989

9090
* xref:api/vectordbs/azure.adoc[ Azure Vector Search] - The https://learn.microsoft.com/en-us/azure/search/vector-search-overview[Azure] vector store.
91-
* xref:api/vectordbs/chroma.adoc[ChromaVectorStore] - The https://www.trychroma.com/[Chroma] vector store.
92-
* xref:api/vectordbs/milvus.adoc[MilvusVectorStore] - The https://milvus.io/[Milvus] vector store.
93-
* xref:api/vectordbs/neo4j.adoc[Neo4jVectorStore] - The https://neo4j.com/[Neo4j] vector store.
91+
* xref:api/vectordbs/apache-cassandra.adoc[Apache Cassandra] - The https://cassandra.apache.org/doc/latest/cassandra/vector-search/overview.html[Apache Cassandra]
92+
* xref:api/vectordbs/chroma.adoc[Chroma Vector Store] - The https://www.trychroma.com/[Chroma] vector store.
93+
* xref:api/vectordbs/gemfire.adoc[GemFire Vector Store] - The https://tanzu.vmware.com/content/blog/vmware-gemfire-vector-database-extension[GemFire] vector store.
94+
* xref:api/vectordbs/milvus.adoc[Milvus Vector Store] - The https://milvus.io/[Milvus] vector store.
95+
* xref:api/vectordbs/neo4j.adoc[Neo4j Vector Store] - The https://neo4j.com/[Neo4j] vector store.
9496
* xref:api/vectordbs/pgvector.adoc[PgVectorStore] - The https://github.com/pgvector/pgvector[PostgreSQL/PGVector] vector store.
95-
* xref:api/vectordbs/pinecone.adoc[PineconeVectorStore] - https://www.pinecone.io/[PineCone] vector store.
96-
* xref:api/vectordbs/qdrant.adoc[QdrantVectorStore] - https://www.qdrant.tech/[Qdrant] vector store.
97-
* xref:api/vectordbs/redis.adoc[RedisVectorStore] - The https://redis.io/[Redis] vector store.
98-
* xref:api/vectordbs/weaviate.adoc[WeaviateVectorStore] - The https://weaviate.io/[Weaviate] vector store.
97+
* xref:api/vectordbs/pinecone.adoc[Pinecone Vector Store] - https://www.pinecone.io/[PineCone] vector store.
98+
* xref:api/vectordbs/qdrant.adoc[Qdrant Vector Store] - https://www.qdrant.tech/[Qdrant] vector store.
99+
* xref:api/vectordbs/redis.adoc[Redis Vector Store] - The https://redis.io/[Redis] vector store.
100+
* xref:api/vectordbs/hana.adoc[SAP Hana Vector Store] - The https://news.sap.com/2024/04/sap-hana-cloud-vector-engine-ai-with-business-context/[SAP HANA] vector store.
101+
* xref:api/vectordbs/weaviate.adoc[Weaviate Vector Store] - The https://weaviate.io/[Weaviate] vector store.
102+
vector store.
99103
* link:https://github.com/spring-projects/spring-ai/blob/main/spring-ai-core/src/main/java/org/springframework/ai/vectorstore/SimpleVectorStore.java[SimpleVectorStore] - A simple implementation of persistent vector storage, good for educational purposes.
100104

101105
More implementations may be supported in future releases.
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
= Apache Cassandra
2+
3+
This section walks you through setting up `CassandraVectorStore` to store document embeddings and perform similarity searches.
4+
5+
== What is Apache Cassandra ?
6+
7+
link:https://cassandra.apache.org[Apache Cassandra] is a true open source distributed database reknown for scalability and high availability without compromising performance.
8+
9+
Linear scalability, proven fault-tolerance and low latency on commodity hardware makes it the perfect platform for mission-critical data. Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.
10+
11+
A vector search in Apache Cassandra is done as simply as:
12+
```
13+
SELECT content FROM table ORDER BY content_vector ANN OF query_embedding ;
14+
```
15+
16+
More docs on this can be read https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html[here].
17+
18+
The Spring AI Cassandra Vector Store is designed to work for both brand new RAG applications as well as being able to be retrofitted on top of existing data and tables. This vector store may also equally be used for non-RAG non_AI use-cases, e.g. semantic searcing in an existing database. The Vector Store will automatically create, or enhance, the schema as needed according to its configuration. If you don't want the schema modifications, configure the store with `disallowSchemaChanges`.
19+
20+
== What is JVector Vector Search ?
21+
22+
link:https://github.com/jbellis/jvector[JVector] is a pure Java embedded vector search engine.
23+
24+
It stands out from other HNSW Vector Similarity Search implementations by being
25+
26+
* Algorithmic-fast. JVector uses state of the art graph algorithms inspired by DiskANN and related research that offer high recall and low latency.
27+
* Implementation-fast. JVector uses the Panama SIMD API to accelerate index build and queries.
28+
* Memory efficient. JVector compresses vectors using product quantization so they can stay in memory during searches. (As part of our PQ implementation, our SIMD-accelerated kmeans class is 5x faster than the one in Apache Commons Math.)
29+
* Disk-aware. JVector’s disk layout is designed to do the minimum necessary iops at query time.
30+
* Concurrent. Index builds scale linearly to at least 32 threads. Double the threads, half the build time.
31+
* Incremental. Query your index as you build it. No delay between adding a vector and being able to find it in search results.
32+
* Easy to embed. API designed for easy embedding, by people using it in production.
33+
34+
== Prerequisites
35+
36+
1. A `EmbeddingClient` instance to compute the document embeddings. This is usually configured as a Spring Bean. Several options are available:
37+
38+
- `Transformers Embedding` - computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works.
39+
- If you want to use OpenAI's Embeddings` - uses the OpenAI embedding endpoint. You need to create an account at link:https://platform.openai.com/signup[OpenAI Signup] and generate the api-key token at link:https://platform.openai.com/account/api-keys[API Keys].
40+
- There are many more choices, see `Embeddings API` docs.
41+
42+
2. An Apache Cassandra instance, from version 5.0-beta1
43+
a. link:https://cassandra.apache.org/_/quickstart.html[DIY Quick Start]
44+
b. For a managed offering https://astra.datastax.com/[Astra DB] offers a healthy free tier offering.
45+
46+
== Dependencies
47+
48+
Add these dependencies to your project:
49+
50+
* For just the Cassandra Vector Store
51+
52+
[source,xml]
53+
----
54+
<dependency>
55+
<groupId>org.springframework.ai</groupId>
56+
<artifactId>spring-ai-cassandra</artifactId>
57+
</dependency>
58+
----
59+
60+
* Or, for everything you need in a RAG application (using the default ONNX Embedding Client)
61+
62+
[source,xml]
63+
----
64+
<dependency>
65+
<groupId>org.springframework.ai</groupId>
66+
<artifactId>spring-ai-cassandra-spring-boot-starter</artifactId>
67+
</dependency>
68+
----
69+
70+
71+
TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.
72+
73+
* If for example you want to use the OpenAI modules, remember to provide your OpenAI API Key. Set it as an environment variable like so:
74+
75+
[source,bash]
76+
----
77+
export SPRING_AI_OPENAI_API_KEY='Your_OpenAI_API_Key'
78+
----
79+
80+
81+
== Usage
82+
83+
Create a CassandraVectorStore instance connected to your Apache Cassandra database:
84+
85+
[source,java]
86+
----
87+
@Bean
88+
public VectorStore vectorStore(EmbeddingClient embeddingClient) {
89+
90+
CassandraVectorStoreConfig config = CassandraVectorStoreConfig.builder().build();
91+
92+
return new CassandraVectorStore(config, embeddingClient);
93+
}
94+
----
95+
96+
NOTE: It is more convenient and preferred to create the `CassandraVectorStore` as a Bean.
97+
But if you decide you can create it manually.
98+
99+
[NOTE]
100+
====
101+
The default configuration connects to Cassandra at localhost:9042 and will automatically create the default schema at `springframework_ai_vector.springframework_ai_vector_store`.
102+
103+
Please see `CassandraVectorStoreConfig.Builder` for all the configuration options.
104+
====
105+
106+
[NOTE]
107+
====
108+
The Cassandra Java Driver is easiest configured via the `application.conf` file on the classpath.
109+
110+
More info can be found link: https://github.com/apache/cassandra-java-driver/tree/4.x/manual/core/configuration[here].
111+
====
112+
113+
Then in your main code, create some documents:
114+
115+
[source,java]
116+
----
117+
List<Document> documents = List.of(
118+
new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", Map.of("country", "UK", "year", 2020)),
119+
new Document("The World is Big and Salvation Lurks Around the Corner", Map.of()),
120+
new Document("You walk forward facing the past and you turn back toward the future.", Map.of("country", "NL", "year", 2023)));
121+
----
122+
123+
Now add the documents to your vector store:
124+
125+
126+
[source,java]
127+
----
128+
vectorStore.add(documents);
129+
----
130+
131+
And finally, retrieve documents similar to a query:
132+
133+
[source,java]
134+
----
135+
List<Document> results = vectorStore.similaritySearch(
136+
SearchRequest.query("Spring").withTopK(5));
137+
----
138+
139+
If all goes well, you should retrieve the document containing the text "Spring AI rocks!!".
140+
141+
You can also limit results based on a similarity threshold:
142+
[source,java]
143+
----
144+
List<Document> results = vectorStore.similaritySearch(
145+
SearchRequest.query("Spring").withTopK(5)
146+
.withSimilarityThreshold(0.5d));
147+
----
148+
149+
=== Metadata filtering
150+
151+
You can leverage the generic, portable link:https://docs.spring.io/spring-ai/reference/api/vectordbs.html#_metadata_filters[metadata filters] with the CassandraVectorStore as well. Metadata fields must be configured in `CassandraVectorStoreConfig`.
152+
153+
For example, you can use either the text expression language:
154+
155+
[source,java]
156+
----
157+
vectorStore.similaritySearch(
158+
SearchRequest.query("The World").withTopK(TOP_K)
159+
.withFilterExpression("country in ['UK', 'NL'] && year >= 2020"));
160+
----
161+
162+
or programmatically using the expression DSL:
163+
164+
[source,java]
165+
----
166+
Filter.Expression f = new FilterExpressionBuilder()
167+
.and(f.in("country", "UK", "NL"), f.gte("year", 2020)).build();
168+
169+
vectorStore.similaritySearch(
170+
SearchRequest.query("The World").withTopK(TOP_K)
171+
.withFilterExpression(f));
172+
----
173+
174+
The portable filter expressions get automatically converted into link:https://cassandra.apache.org/doc/latest/cassandra/developing/cql/index.html[CQL queries].
175+
176+
Metadata fields to be searchable need to be either primary key columns or SAI indexed. To do this configure the metadata field with the `SchemaColumnTags.INDEXED`.
177+
178+
179+
== Advanced Example: Vector Store ontop full Wikipedia dataset
180+
181+
The following example demonstrates how to use the store on an existing schema. Here we use the schema from the https://github.com/datastax-labs/colbert-wikipedia-data project which comes with the full wikipedia dataset ready vectorised for you.
182+
183+
184+
== Usage
185+
186+
Create the schema in the Cassandra database first:
187+
188+
[source,bash]
189+
----
190+
wget https://raw.githubusercontent.com/datastax-labs/colbert-wikipedia-data/main/schema.cql -O colbert-wikipedia-schema.cql
191+
cqlsh -f colbert-wikipedia-schema.cql
192+
----
193+
194+
Then configure the store like:
195+
196+
[source,java]
197+
----
198+
@Bean
199+
public CassandraVectorStore store(EmbeddingClient embeddingClient) {
200+
201+
List<SchemaColumn> partitionColumns = List.of(new SchemaColumn("wiki", DataTypes.TEXT),
202+
new SchemaColumn("language", DataTypes.TEXT), new SchemaColumn("title", DataTypes.TEXT));
203+
204+
List<SchemaColumn> clusteringColumns = List.of(new SchemaColumn("chunk_no", DataTypes.INT),
205+
new SchemaColumn("bert_embedding_no", DataTypes.INT));
206+
207+
List<SchemaColumn> extraColumns = List.of(new SchemaColumn("revision", DataTypes.INT),
208+
new SchemaColumn("id", DataTypes.INT));
209+
210+
CassandraVectorStoreConfig conf = CassandraVectorStoreConfig.builder()
211+
.withKeyspaceName("wikidata")
212+
.withTableName("articles")
213+
.withPartitionKeys(partitionColumns)
214+
.withClusteringKeys(clusteringColumns)
215+
.withContentFieldName("body")
216+
.withEmbeddingFieldName("all_minilm_l6_v2_embedding")
217+
.withIndexName("all_minilm_l6_v2_ann")
218+
.disallowSchemaChanges()
219+
.addMetadataFields(extraColumns)
220+
.withPrimaryKeyTranslator((List<Object> primaryKeys) -> {
221+
// the deliminator used to join fields together into the document's id
222+
// is arbitary, here "§¶" is used
223+
if (primaryKeys.isEmpty()) {
224+
return "test§¶0";
225+
}
226+
return format("%s§¶%s", primaryKeys.get(2), primaryKeys.get(3));
227+
})
228+
.withDocumentIdTranslator((id) -> {
229+
String[] parts = id.split("§¶");
230+
String title = parts[0];
231+
int chunk_no = 0 < parts.length ? Integer.parseInt(parts[1]) : 0;
232+
return List.of("simplewiki", "en", title, chunk_no, 0);
233+
})
234+
.build();
235+
236+
return new CassandraVectorStore(conf, embeddingClient());
237+
}
238+
239+
@Bean
240+
public EmbeddingClient embeddingClient() {
241+
// default is ONNX all-MiniLM-L6-v2 which is what we want
242+
return new TransformersEmbeddingClient();
243+
}
244+
----
245+
246+
And, if you would like to load the full wikipedia dataset.
247+
First download the `simplewiki-sstable.tar` from this link https://drive.google.com/file/d/1CcMMsj8jTKRVGep4A7hmOSvaPepsaKYP/view?usp=share_link . This will take a while, the file is tens of GBs.
248+
249+
[source,bash]
250+
----
251+
tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/
252+
253+
nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/
254+
----
255+
256+
NOTE: If you have existing data in this table you'll want to check the tarball's files don't clobber existing sstables when doing the `tar`.
257+
258+
NOTE: An alternative to the `nodetool import` is to just restart Cassandra.
259+
260+
NOTE: If there are any failures in the indexes they will be rebuilt automatically.

spring-ai-docs/src/main/antora/modules/ROOT/pages/index.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Spring AI provides the following features:
1515
* Supported Model types are Chat and Text to Image with more on the way.
1616
* Portable API across AI providers for Chat and for Embedding models. Both synchronous and stream API options are supported. Dropping down to access model specific features is also supported.
1717
* Mapping of AI Model output to POJOs.
18-
* Support for all major Vector Database providers such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQL/PGVector, PineCone, Qdrant, Redis, and Weaviate
18+
* Support for all major Vector Database providers such as Apache Cassandra, Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQL/PGVector, PineCone, Qdrant, Redis, and Weaviate
1919
* Portable API across Vector Store providers, including a novel SQL-like metadata filter API that is also portable.
2020
* Function calling
2121
* Spring Boot Auto Configuration and Starters for AI Models and Vector Stores.

0 commit comments

Comments
 (0)