|
| 1 | += Apache Cassandra |
| 2 | + |
| 3 | +This section walks you through setting up `CassandraVectorStore` to store document embeddings and perform similarity searches. |
| 4 | + |
| 5 | +== What is Apache Cassandra ? |
| 6 | + |
| 7 | +link:https://cassandra.apache.org[Apache Cassandra] is a true open source distributed database reknown for scalability and high availability without compromising performance. |
| 8 | + |
| 9 | +Linear scalability, proven fault-tolerance and low latency on commodity hardware makes it the perfect platform for mission-critical data. Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy. |
| 10 | + |
| 11 | +A vector search in Apache Cassandra is done as simply as: |
| 12 | +``` |
| 13 | +SELECT content FROM table ORDER BY content_vector ANN OF query_embedding ; |
| 14 | +``` |
| 15 | + |
| 16 | +More docs on this can be read https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html[here]. |
| 17 | + |
| 18 | +The Spring AI Cassandra Vector Store is designed to work for both brand new RAG applications as well as being able to be retrofitted on top of existing data and tables. This vector store may also equally be used for non-RAG non_AI use-cases, e.g. semantic searcing in an existing database. The Vector Store will automatically create, or enhance, the schema as needed according to its configuration. If you don't want the schema modifications, configure the store with `disallowSchemaChanges`. |
| 19 | + |
| 20 | +== What is JVector Vector Search ? |
| 21 | + |
| 22 | +link:https://github.com/jbellis/jvector[JVector] is a pure Java embedded vector search engine. |
| 23 | + |
| 24 | +It stands out from other HNSW Vector Similarity Search implementations by being |
| 25 | + |
| 26 | +* Algorithmic-fast. JVector uses state of the art graph algorithms inspired by DiskANN and related research that offer high recall and low latency. |
| 27 | +* Implementation-fast. JVector uses the Panama SIMD API to accelerate index build and queries. |
| 28 | +* Memory efficient. JVector compresses vectors using product quantization so they can stay in memory during searches. (As part of our PQ implementation, our SIMD-accelerated kmeans class is 5x faster than the one in Apache Commons Math.) |
| 29 | +* Disk-aware. JVector’s disk layout is designed to do the minimum necessary iops at query time. |
| 30 | +* Concurrent. Index builds scale linearly to at least 32 threads. Double the threads, half the build time. |
| 31 | +* Incremental. Query your index as you build it. No delay between adding a vector and being able to find it in search results. |
| 32 | +* Easy to embed. API designed for easy embedding, by people using it in production. |
| 33 | + |
| 34 | +== Prerequisites |
| 35 | + |
| 36 | +1. A `EmbeddingClient` instance to compute the document embeddings. This is usually configured as a Spring Bean. Several options are available: |
| 37 | + |
| 38 | +- `Transformers Embedding` - computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works. |
| 39 | +- If you want to use OpenAI's Embeddings` - uses the OpenAI embedding endpoint. You need to create an account at link:https://platform.openai.com/signup[OpenAI Signup] and generate the api-key token at link:https://platform.openai.com/account/api-keys[API Keys]. |
| 40 | +- There are many more choices, see `Embeddings API` docs. |
| 41 | + |
| 42 | +2. An Apache Cassandra instance, from version 5.0-beta1 |
| 43 | +a. link:https://cassandra.apache.org/_/quickstart.html[DIY Quick Start] |
| 44 | +b. For a managed offering https://astra.datastax.com/[Astra DB] offers a healthy free tier offering. |
| 45 | + |
| 46 | +== Dependencies |
| 47 | + |
| 48 | +Add these dependencies to your project: |
| 49 | + |
| 50 | +* For just the Cassandra Vector Store |
| 51 | + |
| 52 | +[source,xml] |
| 53 | +---- |
| 54 | +<dependency> |
| 55 | + <groupId>org.springframework.ai</groupId> |
| 56 | + <artifactId>spring-ai-cassandra</artifactId> |
| 57 | +</dependency> |
| 58 | +---- |
| 59 | + |
| 60 | +* Or, for everything you need in a RAG application (using the default ONNX Embedding Client) |
| 61 | + |
| 62 | +[source,xml] |
| 63 | +---- |
| 64 | +<dependency> |
| 65 | + <groupId>org.springframework.ai</groupId> |
| 66 | + <artifactId>spring-ai-cassandra-spring-boot-starter</artifactId> |
| 67 | +</dependency> |
| 68 | +---- |
| 69 | + |
| 70 | + |
| 71 | +TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file. |
| 72 | + |
| 73 | +* If for example you want to use the OpenAI modules, remember to provide your OpenAI API Key. Set it as an environment variable like so: |
| 74 | + |
| 75 | +[source,bash] |
| 76 | +---- |
| 77 | +export SPRING_AI_OPENAI_API_KEY='Your_OpenAI_API_Key' |
| 78 | +---- |
| 79 | + |
| 80 | + |
| 81 | +== Usage |
| 82 | + |
| 83 | +Create a CassandraVectorStore instance connected to your Apache Cassandra database: |
| 84 | + |
| 85 | +[source,java] |
| 86 | +---- |
| 87 | +@Bean |
| 88 | +public VectorStore vectorStore(EmbeddingClient embeddingClient) { |
| 89 | +
|
| 90 | + CassandraVectorStoreConfig config = CassandraVectorStoreConfig.builder().build(); |
| 91 | +
|
| 92 | + return new CassandraVectorStore(config, embeddingClient); |
| 93 | +} |
| 94 | +---- |
| 95 | + |
| 96 | +NOTE: It is more convenient and preferred to create the `CassandraVectorStore` as a Bean. |
| 97 | +But if you decide you can create it manually. |
| 98 | + |
| 99 | +[NOTE] |
| 100 | +==== |
| 101 | +The default configuration connects to Cassandra at localhost:9042 and will automatically create the default schema at `springframework_ai_vector.springframework_ai_vector_store`. |
| 102 | +
|
| 103 | +Please see `CassandraVectorStoreConfig.Builder` for all the configuration options. |
| 104 | +==== |
| 105 | + |
| 106 | +[NOTE] |
| 107 | +==== |
| 108 | +The Cassandra Java Driver is easiest configured via the `application.conf` file on the classpath. |
| 109 | +
|
| 110 | +More info can be found link: https://github.com/apache/cassandra-java-driver/tree/4.x/manual/core/configuration[here]. |
| 111 | +==== |
| 112 | + |
| 113 | +Then in your main code, create some documents: |
| 114 | + |
| 115 | +[source,java] |
| 116 | +---- |
| 117 | +List<Document> documents = List.of( |
| 118 | + new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", Map.of("country", "UK", "year", 2020)), |
| 119 | + new Document("The World is Big and Salvation Lurks Around the Corner", Map.of()), |
| 120 | + new Document("You walk forward facing the past and you turn back toward the future.", Map.of("country", "NL", "year", 2023))); |
| 121 | +---- |
| 122 | + |
| 123 | +Now add the documents to your vector store: |
| 124 | + |
| 125 | + |
| 126 | +[source,java] |
| 127 | +---- |
| 128 | +vectorStore.add(documents); |
| 129 | +---- |
| 130 | + |
| 131 | +And finally, retrieve documents similar to a query: |
| 132 | + |
| 133 | +[source,java] |
| 134 | +---- |
| 135 | +List<Document> results = vectorStore.similaritySearch( |
| 136 | + SearchRequest.query("Spring").withTopK(5)); |
| 137 | +---- |
| 138 | + |
| 139 | +If all goes well, you should retrieve the document containing the text "Spring AI rocks!!". |
| 140 | + |
| 141 | +You can also limit results based on a similarity threshold: |
| 142 | +[source,java] |
| 143 | +---- |
| 144 | +List<Document> results = vectorStore.similaritySearch( |
| 145 | + SearchRequest.query("Spring").withTopK(5) |
| 146 | + .withSimilarityThreshold(0.5d)); |
| 147 | +---- |
| 148 | + |
| 149 | +=== Metadata filtering |
| 150 | + |
| 151 | +You can leverage the generic, portable link:https://docs.spring.io/spring-ai/reference/api/vectordbs.html#_metadata_filters[metadata filters] with the CassandraVectorStore as well. Metadata fields must be configured in `CassandraVectorStoreConfig`. |
| 152 | + |
| 153 | +For example, you can use either the text expression language: |
| 154 | + |
| 155 | +[source,java] |
| 156 | +---- |
| 157 | +vectorStore.similaritySearch( |
| 158 | + SearchRequest.query("The World").withTopK(TOP_K) |
| 159 | + .withFilterExpression("country in ['UK', 'NL'] && year >= 2020")); |
| 160 | +---- |
| 161 | + |
| 162 | +or programmatically using the expression DSL: |
| 163 | + |
| 164 | +[source,java] |
| 165 | +---- |
| 166 | +Filter.Expression f = new FilterExpressionBuilder() |
| 167 | + .and(f.in("country", "UK", "NL"), f.gte("year", 2020)).build(); |
| 168 | +
|
| 169 | +vectorStore.similaritySearch( |
| 170 | + SearchRequest.query("The World").withTopK(TOP_K) |
| 171 | + .withFilterExpression(f)); |
| 172 | +---- |
| 173 | + |
| 174 | +The portable filter expressions get automatically converted into link:https://cassandra.apache.org/doc/latest/cassandra/developing/cql/index.html[CQL queries]. |
| 175 | + |
| 176 | +Metadata fields to be searchable need to be either primary key columns or SAI indexed. To do this configure the metadata field with the `SchemaColumnTags.INDEXED`. |
| 177 | + |
| 178 | + |
| 179 | +== Advanced Example: Vector Store ontop full Wikipedia dataset |
| 180 | + |
| 181 | +The following example demonstrates how to use the store on an existing schema. Here we use the schema from the https://github.com/datastax-labs/colbert-wikipedia-data project which comes with the full wikipedia dataset ready vectorised for you. |
| 182 | + |
| 183 | + |
| 184 | +== Usage |
| 185 | + |
| 186 | +Create the schema in the Cassandra database first: |
| 187 | + |
| 188 | +[source,bash] |
| 189 | +---- |
| 190 | +wget https://raw.githubusercontent.com/datastax-labs/colbert-wikipedia-data/main/schema.cql -O colbert-wikipedia-schema.cql |
| 191 | +cqlsh -f colbert-wikipedia-schema.cql |
| 192 | +---- |
| 193 | + |
| 194 | +Then configure the store like: |
| 195 | + |
| 196 | +[source,java] |
| 197 | +---- |
| 198 | +@Bean |
| 199 | +public CassandraVectorStore store(EmbeddingClient embeddingClient) { |
| 200 | +
|
| 201 | + List<SchemaColumn> partitionColumns = List.of(new SchemaColumn("wiki", DataTypes.TEXT), |
| 202 | + new SchemaColumn("language", DataTypes.TEXT), new SchemaColumn("title", DataTypes.TEXT)); |
| 203 | +
|
| 204 | + List<SchemaColumn> clusteringColumns = List.of(new SchemaColumn("chunk_no", DataTypes.INT), |
| 205 | + new SchemaColumn("bert_embedding_no", DataTypes.INT)); |
| 206 | +
|
| 207 | + List<SchemaColumn> extraColumns = List.of(new SchemaColumn("revision", DataTypes.INT), |
| 208 | + new SchemaColumn("id", DataTypes.INT)); |
| 209 | +
|
| 210 | + CassandraVectorStoreConfig conf = CassandraVectorStoreConfig.builder() |
| 211 | + .withKeyspaceName("wikidata") |
| 212 | + .withTableName("articles") |
| 213 | + .withPartitionKeys(partitionColumns) |
| 214 | + .withClusteringKeys(clusteringColumns) |
| 215 | + .withContentFieldName("body") |
| 216 | + .withEmbeddingFieldName("all_minilm_l6_v2_embedding") |
| 217 | + .withIndexName("all_minilm_l6_v2_ann") |
| 218 | + .disallowSchemaChanges() |
| 219 | + .addMetadataFields(extraColumns) |
| 220 | + .withPrimaryKeyTranslator((List<Object> primaryKeys) -> { |
| 221 | + // the deliminator used to join fields together into the document's id |
| 222 | + // is arbitary, here "§¶" is used |
| 223 | + if (primaryKeys.isEmpty()) { |
| 224 | + return "test§¶0"; |
| 225 | + } |
| 226 | + return format("%s§¶%s", primaryKeys.get(2), primaryKeys.get(3)); |
| 227 | + }) |
| 228 | + .withDocumentIdTranslator((id) -> { |
| 229 | + String[] parts = id.split("§¶"); |
| 230 | + String title = parts[0]; |
| 231 | + int chunk_no = 0 < parts.length ? Integer.parseInt(parts[1]) : 0; |
| 232 | + return List.of("simplewiki", "en", title, chunk_no, 0); |
| 233 | + }) |
| 234 | + .build(); |
| 235 | +
|
| 236 | + return new CassandraVectorStore(conf, embeddingClient()); |
| 237 | +} |
| 238 | +
|
| 239 | +@Bean |
| 240 | +public EmbeddingClient embeddingClient() { |
| 241 | + // default is ONNX all-MiniLM-L6-v2 which is what we want |
| 242 | + return new TransformersEmbeddingClient(); |
| 243 | +} |
| 244 | +---- |
| 245 | + |
| 246 | +And, if you would like to load the full wikipedia dataset. |
| 247 | +First download the `simplewiki-sstable.tar` from this link https://drive.google.com/file/d/1CcMMsj8jTKRVGep4A7hmOSvaPepsaKYP/view?usp=share_link . This will take a while, the file is tens of GBs. |
| 248 | + |
| 249 | +[source,bash] |
| 250 | +---- |
| 251 | +tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/ |
| 252 | +
|
| 253 | +nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/ |
| 254 | +---- |
| 255 | + |
| 256 | +NOTE: If you have existing data in this table you'll want to check the tarball's files don't clobber existing sstables when doing the `tar`. |
| 257 | + |
| 258 | +NOTE: An alternative to the `nodetool import` is to just restart Cassandra. |
| 259 | + |
| 260 | +NOTE: If there are any failures in the indexes they will be rebuilt automatically. |
0 commit comments