|
1 |
| -# terminusdb-semantic-indexer |
2 |
| -Semantic Indexer |
| 1 | +# Vemdex: TerminusDB Semantic Indexer |
| 2 | + |
| 3 | +The TerminusDB Semantic Indexer is a vector database with an index |
| 4 | +based on Hierarchical Navigable Small World graphs written in rust. It |
| 5 | +is designed to work closely with TerminusDB but can be used with any |
| 6 | +project via a simple HTTP api. In order to work well with TerminusDB |
| 7 | +it is designed with the following features: |
| 8 | + |
| 9 | +* Domains: The database can manage several domains. In a domain you |
| 10 | + have a vector store which is append only. This allows you to share |
| 11 | + vectors across indexes. |
| 12 | +* Commits: Each index exists at a commit. The index can point to any |
| 13 | + vector in a domain. This allows us to add and remove vectors by |
| 14 | + changing only the index. |
| 15 | +* Incremental Indexing: The indexer can take a previous commit, and |
| 16 | + then perform the operations specified to obtain a new commit. |
| 17 | +* Connects with a text-to-vector embedding API in order to convert |
| 18 | + content into vectors. |
| 19 | + |
| 20 | +To invoke the server, you can run it as follows: |
| 21 | + |
| 22 | +## Compiling |
| 23 | + |
| 24 | +You can comile the system with cargo: |
| 25 | + |
| 26 | +```shell |
| 27 | +cargo compile --release |
| 28 | +``` |
| 29 | + |
| 30 | +## Invoking |
| 31 | + |
| 32 | +In order to invoke the server, you need to supply an OpenAI key. This |
| 33 | +will provide you with embeddings for your text. |
| 34 | + |
| 35 | +You can do this by either setting the env variable `OPENAI_KEY` or by |
| 36 | +using the `--key` command line option. |
| 37 | + |
| 38 | +```shell |
| 39 | +terminusdb-semantic-indexer serve --directory /path/to/storage/dir |
| 40 | +``` |
| 41 | + |
| 42 | +## Indexing |
| 43 | + |
| 44 | +If you wan to index documents, you can any of these methods: |
| 45 | + |
| 46 | +* Run a TerminusDB installation and refer to real commits and databases |
| 47 | +* Put up an endpoint that will issue the appropriate operations for a |
| 48 | +commit id and a domain |
| 49 | +* use the `load` command |
| 50 | + |
| 51 | +In any case, the database expects a content which will have the form |
| 52 | +(in JSONlines format): |
| 53 | + |
| 54 | +```json |
| 55 | +{"id":"terminusdb:///star-wars/People/20", "op":"Inserted", "string":"The person's name is Yoda. They are described with the following synopsis: Yoda is a fictional character in the Star Wars franchise created by George Lucas, first appearing in the 1980 film The Empire Strikes Back. In the original films, he trains Luke Skywalker to fight against the Galactic Empire. In the prequel films, he serves as the Grand Master of the Jedi Order and as a high-ranking general of Clone Troopers in the Clone Wars. Following his death in Return of the Jedi at the age of 900, Yoda was the oldest living character in the Star Wars franchise in canon, until the introduction of Maz Kanata in Star Wars: The Force Awakens. Their gender is male. They have the following hair colours: white. They have a mass of 17. Their skin colours are green."} |
| 56 | +{"id":"terminusdb:///star-wars/People/21", "op":"Deleted"} |
| 57 | +{"id":"terminusdb:///star-wars/People/22", "op":"Replaced", "string":"The person's name is Boba Fett. They are described with the following synopsis: Boba Fett is a fictional character in the Star Wars franchise. In The Empire Strikes Back and Return of the Jedi, he is a bounty hunter hired by Darth Vader and also employed by Jabba the Hutt. He was also added briefly to the original film Star Wars when the film was digitally remastered. Star Wars: Episode II – Attack of the Clones establishes his origin as an unaltered clone of the bounty hunter Jango Fett raised as his son. He also appears in several episodes of Star Wars: The Clone Wars cartoon series which further describes his growth as a villain in the Star Wars universe. His aura of danger and mystery has created a cult following for the character. Their gender is male. They have the following hair colours: black. They have a mass of 78.2. Their skin colours are fair."} |
| 58 | +``` |
| 59 | + |
| 60 | +To kick off indexing you can submit the following request to the Vemdex server |
| 61 | + |
| 62 | +```shell |
| 63 | +curl 'localhost:8080/index?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars' |
| 64 | +``` |
| 65 | + |
| 66 | +This invokes the indexer for commit `0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn` |
| 67 | +and domain `admin/star_wars`. |
| 68 | + |
| 69 | +## Searching |
| 70 | + |
| 71 | +Searching is easy, you can specify a natural language query to the server as follows: |
| 72 | + |
| 73 | +```shell |
| 74 | +curl 'localhost:8080/search?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars' -d "Wise old man" |
| 75 | +``` |
| 76 | + |
| 77 | +You can also find nearby documents with: |
| 78 | + |
| 79 | +```shell |
| 80 | +curl 'localhost:8080/similar?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars?id=MyExternalID' |
| 81 | +``` |
| 82 | + |
| 83 | +The `MyExternalID` refers to the name you gave the record during |
| 84 | +indexing (specified by the `id` field). |
| 85 | + |
| 86 | +## Todo |
| 87 | + |
| 88 | +Lots of work to make this the open source versioned vector database |
| 89 | +that the world deserves. Anyone who wants to work on the project to |
| 90 | +advance these aims is welcome: |
| 91 | + |
| 92 | +* Add other AI configurations for obtaining the embeddings - we'd like |
| 93 | + to be very complete and have ways of configuring other vendors and |
| 94 | + open source text-to-embedding systems. |
| 95 | +* Greater scope of metric support |
| 96 | +* Improve compression: We'd like to have a sytem of vector compression |
| 97 | + such as PQ for dealing with very large datasets. |
| 98 | +* Better treatment of deletion and replace |
| 99 | +* Better incrementality of the index structure |
| 100 | +* Smaller graph reprsentations of the indicies - using succinct data |
| 101 | + structures to reduce memory overhead. |
| 102 | + |
| 103 | +And if you have new ideas we'd love to hear them! |
| 104 | + |
| 105 | + |
0 commit comments