Skip to content

Save many text fragments as WAV files with blazing-fast semantic search. No database required.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Ave-Sergeev/Disorder

Repository files navigation

Disorder


Russian version

Description

This project is a gRPC server written in Rust.
It implements the ability to store multiple text fragments inside audio files, and lightning-fast semantic search on them.
Among other things, it can be used as a retriever part in a RAG system.

UPD: The project is not finished, improvements will be added as soon as possible.

Functionality

  • Index building: creating an HNSW search index from vector representations.
  • Semantic search: performing fast vector similarity search on stored text fragments.
  • Parallel processing: optimized search using parallel processing.
  • No database required: all data is stored locally in WAV audio files and JSON metadata.

A preloaded local model is used to create embeddings (vector representations of text) (without connecting to external AI APIs). For example, you can use the following models:

To use them, you need to download and put the following files into the ./model directory of the project:

  • model.onnx,
  • config.json,
  • tokenizer.json,
  • tokenizer_config.json,
  • special_tokens_map.json.

Configuration

In config.yaml the values for the fields are set:

  • Auth
    • username - username (for basic auth).
    • password - password (for basic auth).
  • Server
    • host - host to run the gRPC server.
    • port - port to run the gRPC server.
  • Logging
    • log_level - log/trace level.
  • RateLimit
    • capacity - maximum number of tokens (bucket capacity).
    • refill_rate - number of tokens added per time interval (refill_interval_ms).
    • refill_interval_ms - duration of refill interval (in milliseconds).
  • App
    • model_dir - directory to store model files (example ./model).
    • audio_dir - directory to store audio files (example ./output/audio).
    • output_dir - directory (parent) for indexing results (example ./output).
    • index_path - path to the search index file (example ./output/hnsw.idx).
    • storage_path - path to the file with metadata (example ./output/storage.json).

The structure of the /output directory after building the index:

output/
├── audio/
│ ├── batch_0.wav # First 500 text chunks
│ ├── batch_1.wav # Next 500 text chunks
│ └── ...
├── hnsw.idx # Search index with embeddings
└── storage.json # Metadata and batch information

Technical details

  • Audio encoding: 16-bit WAV files (mono), 48 kHz sampling rate.
  • Batch size: 500 text fragments per audio file (configurable in audio.rs).
  • Embedding model: any embedding model can be used (examples above).
  • Search algorithm: HNSW with cosine similarity + fallback parallel linear search.

Audio file batch_n structure (after encoding)

  • First, the number of chunks (4 bytes)
  • Then for each chunk:
    • Length (4 bytes)
    • Data (N bytes)
  • Trailing zeros (500)

RateLimiting uses the Token Bucket algorithm.
It is worth considering that this algorithm can allow a burst when tokens are accumulated (the bucket is full).
Currently implemented via a third-party crate rater.
The rate limit is applied to all routes in total.
To calculate RPS, use the formula refill_rate * 1000 / refill_interval_ms.

Usage

To send a request to the server, take text_indexer.proto (from the ./proto directory), and use it in your client. You can check the functionality, for example, via Postman.

Request structure for rpc BuildIndex:

Metadata:
  • authorization: Basic <base64_token> - username:password authentication token in base64 format.
  • correlation-id: <id> - identifier for request tracing (if not specified, the server will generate its own).
Message:

If the text file is located on the server, you can specify the path to it:

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "file_path": "./articles_on_various_topics.txt",
  "chunk_size": 150
}

Otherwise, you can pass text in the request.

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "content": "text in base64 format",
  "chunk_size": 150
}

As a result, the server will return JSON of the following type:

{
  "id": "123e4567-e89b-12d3-a456-426614174000"
}

Request structure for rpc SearchIndex:

Metadata:
  • authorization: Basic <base64_token> - username:password authentication token in base64 format.
  • correlation-id: <id> - identifier for request tracing (if not specified, the server will generate its own).
Message:
{
  "id": "123e4567-e89b-12d3-a456-426614174001",
  "query": "Scientific discoveries of the Hubble Space Telescope",
  "top_k": 5,
  "min_similarity": 0.3
}

As a result, the server will return JSON of the following type:

{
   "id": "123e4567-e89b-12d3-a456-426614174001",
   "results": [
      {
         "text": "One of the Hubble Space Telescope's major discoveries is evidence that the expansion of the Universe is accelerating, driven by dark energy.",
         "score": 0.9029404520988464
      },
      {
         "text": "The Hubble Space Telescope has captured amazing star formation in nebulae such as Orion and Aquila.",
         "score": 0.8357565402984619
      },
      {
         "text": "Launched in 1990, the Hubble Space Telescope has become one of the most important instruments in the history of astronomy.",
         "score": 0.7911539673805237
      },
      {
         "text": "In three decades of operation, Hubble has helped to clarify the age of the Universe and prove the existence of dark matter.",
         "score": 0.7869136929512024
      },
      {
         "text": "In some cultures, crows are considered a symbol of wisdom, and science backs up that reputation.",
         "score": 0.3231840753555298
      }
   ]
}

Local startup

  1. To install Rust on Unix-like systems (MacOS, Linux, ...) - run the command in the terminal. After the download is complete, you will get the latest stable version of Rust for your platform, as well as the latest version of Cargo.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Run the following command in the terminal to verify.
    If the installation is successful (step 1), you will see something like cargo 1.88.0 ....
cargo --version
  1. We clone the project from GitHub, open it, and execute the following commands.

Check the code to see if it can be compiled (without running it).

cargo check

Build + run the project (in release mode with optimizations).

cargo run --release

UDP: If you have Windows, see Instructions here.

Local deployment

To deploy a project locally in Docker, you need to:

  1. Make sure Docker daemon is running.
  2. Make sure embedding model is present in the ./model directory of the project (files downloaded and added).
  3. Open a terminal in the root of the project, and run the command (for example docker build -t disorder-server .).
  4. After the project is built, run the command (for example docker run -rm -p 9090:9090 disorder-server).
  5. Enjoy using the service.

Acknowledgments

This project was inspired by memau, a project that stores data in audio files.

License

This project is licensed under the MIT License or Apache License 2.0, your choice.

About

Save many text fragments as WAV files with blazing-fast semantic search. No database required.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published