GitHub - Ave-Sergeev/Disorder: Save many text fragments as WAV files with blazing-fast semantic search. No database required.

Disorder

Description

This project is a gRPC server written in Rust.
It implements the ability to store multiple text fragments inside audio files, and lightning-fast semantic search on them.
Among other things, it can be used as a retriever part in a RAG system.

UPD: The project is not finished, improvements will be added as soon as possible.

Functionality

Index building: creating an HNSW search index from vector representations.
Semantic search: performing fast vector similarity search on stored text fragments.
Parallel processing: optimized search using parallel processing.
No database required: all data is stored locally in WAV audio files and JSON metadata.

A preloaded local model is used to create embeddings (vector representations of text) (without connecting to external AI APIs). For example, you can use the following models:

EN
- https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 (~ 135 MB)
- https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (~ 437 MB)
- https://huggingface.co/BAAI/bge-large-en-v1.5 (~ 1.3 GB)
- https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5 (~ 1.7 GB)

To use them, you need to download and put the following files into the ./model directory of the project:

model.onnx,
config.json,
tokenizer.json,
tokenizer_config.json,
special_tokens_map.json.

Configuration

In config.yaml the values for the fields are set:

Auth
- username - username (for basic auth).
- password - password (for basic auth).
Server
- host - host to run the gRPC server.
- port - port to run the gRPC server.
Logging
- log_level - log/trace level.
RateLimit
- capacity - maximum number of tokens (bucket capacity).
- refill_rate - number of tokens added per time interval (refill_interval_ms).
- refill_interval_ms - duration of refill interval (in milliseconds).
App
- model_dir - directory to store model files (example ./model).
- audio_dir - directory to store audio files (example ./output/audio).
- output_dir - directory (parent) for indexing results (example ./output).
- index_path - path to the search index file (example ./output/hnsw.idx).
- storage_path - path to the file with metadata (example ./output/storage.json).

The structure of the /output directory after building the index:

output/
├── audio/
│ ├── batch_0.wav # First 500 text chunks
│ ├── batch_1.wav # Next 500 text chunks
│ └── ...
├── hnsw.idx # Search index with embeddings
└── storage.json # Metadata and batch information

Technical details

Audio encoding: 16-bit WAV files (mono), 48 kHz sampling rate.
Batch size: 500 text fragments per audio file (configurable in audio.rs).
Embedding model: any embedding model can be used (examples above).
Search algorithm: HNSW with cosine similarity + fallback parallel linear search.

Audio file batch_n structure (after encoding)

First, the number of chunks (4 bytes)
Then for each chunk:
- Length (4 bytes)
- Data (N bytes)
Trailing zeros (500)

RateLimiting uses the Token Bucket algorithm.
It is worth considering that this algorithm can allow a burst when tokens are accumulated (the bucket is full).
Currently implemented via a third-party crate rater.
The rate limit is applied to all routes in total.
To calculate RPS, use the formula refill_rate * 1000 / refill_interval_ms.

Usage

To send a request to the server, take text_indexer.proto (from the ./proto directory), and use it in your client. You can check the functionality, for example, via Postman.

Request structure for rpc `BuildIndex`:

Metadata:

authorization: Basic <base64_token> - username:password authentication token in base64 format.
correlation-id: <id> - identifier for request tracing (if not specified, the server will generate its own).

Message:

If the text file is located on the server, you can specify the path to it:

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "file_path": "./articles_on_various_topics.txt",
  "chunk_size": 150
}

Otherwise, you can pass text in the request.

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "content": "text in base64 format",
  "chunk_size": 150
}

As a result, the server will return JSON of the following type:

{
  "id": "123e4567-e89b-12d3-a456-426614174000"
}

Request structure for rpc `SearchIndex`:

Metadata:

authorization: Basic <base64_token> - username:password authentication token in base64 format.
correlation-id: <id> - identifier for request tracing (if not specified, the server will generate its own).

Message:

{
  "id": "123e4567-e89b-12d3-a456-426614174001",
  "query": "Scientific discoveries of the Hubble Space Telescope",
  "top_k": 5,
  "min_similarity": 0.3
}

As a result, the server will return JSON of the following type:

{
   "id": "123e4567-e89b-12d3-a456-426614174001",
   "results": [
      {
         "text": "One of the Hubble Space Telescope's major discoveries is evidence that the expansion of the Universe is accelerating, driven by dark energy.",
         "score": 0.9029404520988464
      },
      {
         "text": "The Hubble Space Telescope has captured amazing star formation in nebulae such as Orion and Aquila.",
         "score": 0.8357565402984619
      },
      {
         "text": "Launched in 1990, the Hubble Space Telescope has become one of the most important instruments in the history of astronomy.",
         "score": 0.7911539673805237
      },
      {
         "text": "In three decades of operation, Hubble has helped to clarify the age of the Universe and prove the existence of dark matter.",
         "score": 0.7869136929512024
      },
      {
         "text": "In some cultures, crows are considered a symbol of wisdom, and science backs up that reputation.",
         "score": 0.3231840753555298
      }
   ]
}

Local startup

To install Rust on Unix-like systems (MacOS, Linux, ...) - run the command in the terminal. After the download is complete, you will get the latest stable version of Rust for your platform, as well as the latest version of Cargo.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Run the following command in the terminal to verify.
If the installation is successful (step 1), you will see something like cargo 1.88.0 ....

cargo --version

We clone the project from GitHub, open it, and execute the following commands.

Check the code to see if it can be compiled (without running it).

cargo check

Build + run the project (in release mode with optimizations).

cargo run --release

UDP: If you have Windows, see Instructions here.

Local deployment

To deploy a project locally in Docker, you need to:

Make sure Docker daemon is running.
Make sure embedding model is present in the ./model directory of the project (files downloaded and added).
Open a terminal in the root of the project, and run the command (for example docker build -t disorder-server .).
After the project is built, run the command (for example docker run -rm -p 9090:9090 disorder-server).
Enjoy using the service.

Acknowledgments

This project was inspired by memau, a project that stores data in audio files.

License

This project is licensed under the MIT License or Apache License 2.0, your choice.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
model		model
output/audio		output/audio
proto		proto
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
README.ru.md		README.ru.md
build.rs		build.rs
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Disorder

Description

Functionality

Configuration

Technical details

Usage

Request structure for rpc `BuildIndex`:

Metadata:

Message:

Request structure for rpc `SearchIndex`:

Metadata:

Message:

Local startup

Local deployment

Acknowledgments

License

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

Ave-Sergeev/Disorder

Folders and files

Latest commit

History

Repository files navigation

Disorder

Description

Functionality

Configuration

Technical details

Usage

Request structure for rpc BuildIndex:

Metadata:

Message:

Request structure for rpc SearchIndex:

Metadata:

Message:

Local startup

Local deployment

Acknowledgments

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Request structure for rpc `BuildIndex`:

Request structure for rpc `SearchIndex`:

Packages