Skip to content

Commit af8f6c2

Browse files
Adding readme
1 parent c163ee5 commit af8f6c2

File tree

1 file changed

+105
-2
lines changed

1 file changed

+105
-2
lines changed

README.md

Lines changed: 105 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,105 @@
1-
# terminusdb-semantic-indexer
2-
Semantic Indexer
1+
# Vemdex: TerminusDB Semantic Indexer
2+
3+
The TerminusDB Semantic Indexer is a vector database with an index
4+
based on Hierarchical Navigable Small World graphs written in rust. It
5+
is designed to work closely with TerminusDB but can be used with any
6+
project via a simple HTTP api. In order to work well with TerminusDB
7+
it is designed with the following features:
8+
9+
* Domains: The database can manage several domains. In a domain you
10+
have a vector store which is append only. This allows you to share
11+
vectors across indexes.
12+
* Commits: Each index exists at a commit. The index can point to any
13+
vector in a domain. This allows us to add and remove vectors by
14+
changing only the index.
15+
* Incremental Indexing: The indexer can take a previous commit, and
16+
then perform the operations specified to obtain a new commit.
17+
* Connects with a text-to-vector embedding API in order to convert
18+
content into vectors.
19+
20+
To invoke the server, you can run it as follows:
21+
22+
## Compiling
23+
24+
You can comile the system with cargo:
25+
26+
```shell
27+
cargo compile --release
28+
```
29+
30+
## Invoking
31+
32+
In order to invoke the server, you need to supply an OpenAI key. This
33+
will provide you with embeddings for your text.
34+
35+
You can do this by either setting the env variable `OPENAI_KEY` or by
36+
using the `--key` command line option.
37+
38+
```shell
39+
terminusdb-semantic-indexer serve --directory /path/to/storage/dir
40+
```
41+
42+
## Indexing
43+
44+
If you wan to index documents, you can any of these methods:
45+
46+
* Run a TerminusDB installation and refer to real commits and databases
47+
* Put up an endpoint that will issue the appropriate operations for a
48+
commit id and a domain
49+
* use the `load` command
50+
51+
In any case, the database expects a content which will have the form
52+
(in JSONlines format):
53+
54+
```json
55+
{"id":"terminusdb:///star-wars/People/20", "op":"Inserted", "string":"The person's name is Yoda. They are described with the following synopsis: Yoda is a fictional character in the Star Wars franchise created by George Lucas, first appearing in the 1980 film The Empire Strikes Back. In the original films, he trains Luke Skywalker to fight against the Galactic Empire. In the prequel films, he serves as the Grand Master of the Jedi Order and as a high-ranking general of Clone Troopers in the Clone Wars. Following his death in Return of the Jedi at the age of 900, Yoda was the oldest living character in the Star Wars franchise in canon, until the introduction of Maz Kanata in Star Wars: The Force Awakens. Their gender is male. They have the following hair colours: white. They have a mass of 17. Their skin colours are green."}
56+
{"id":"terminusdb:///star-wars/People/21", "op":"Deleted"}
57+
{"id":"terminusdb:///star-wars/People/22", "op":"Replaced", "string":"The person's name is Boba Fett. They are described with the following synopsis: Boba Fett is a fictional character in the Star Wars franchise. In The Empire Strikes Back and Return of the Jedi, he is a bounty hunter hired by Darth Vader and also employed by Jabba the Hutt. He was also added briefly to the original film Star Wars when the film was digitally remastered. Star Wars: Episode II – Attack of the Clones establishes his origin as an unaltered clone of the bounty hunter Jango Fett raised as his son. He also appears in several episodes of Star Wars: The Clone Wars cartoon series which further describes his growth as a villain in the Star Wars universe. His aura of danger and mystery has created a cult following for the character. Their gender is male. They have the following hair colours: black. They have a mass of 78.2. Their skin colours are fair."}
58+
```
59+
60+
To kick off indexing you can submit the following request to the Vemdex server
61+
62+
```shell
63+
curl 'localhost:8080/index?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars'
64+
```
65+
66+
This invokes the indexer for commit `0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn`
67+
and domain `admin/star_wars`.
68+
69+
## Searching
70+
71+
Searching is easy, you can specify a natural language query to the server as follows:
72+
73+
```shell
74+
curl 'localhost:8080/search?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars' -d "Wise old man"
75+
```
76+
77+
You can also find nearby documents with:
78+
79+
```shell
80+
curl 'localhost:8080/similar?commit=0vj85ifuvfcn4vwqf7w4mo2kfa3ekkn&domain=admin/star_wars?id=MyExternalID'
81+
```
82+
83+
The `MyExternalID` refers to the name you gave the record during
84+
indexing (specified by the `id` field).
85+
86+
## Todo
87+
88+
Lots of work to make this the open source versioned vector database
89+
that the world deserves. Anyone who wants to work on the project to
90+
advance these aims is welcome:
91+
92+
* Add other AI configurations for obtaining the embeddings - we'd like
93+
to be very complete and have ways of configuring other vendors and
94+
open source text-to-embedding systems.
95+
* Greater scope of metric support
96+
* Improve compression: We'd like to have a sytem of vector compression
97+
such as PQ for dealing with very large datasets.
98+
* Better treatment of deletion and replace
99+
* Better incrementality of the index structure
100+
* Smaller graph reprsentations of the indicies - using succinct data
101+
structures to reduce memory overhead.
102+
103+
And if you have new ideas we'd love to hear them!
104+
105+

0 commit comments

Comments
 (0)