A FastAPI application that demonstrates Retrieval-Augmented Generation (RAG) concepts, combining:
- Audio/Video transcription via AssemblyAI,
- Semantic search over text sections with FAISS,
- Optional usage of LLaMA-based embeddings or a fake embeddings class.

- Upload or record audio/video, transcribe with AssemblyAI, and get
SRT
/VTT
subtitles. - Semantic Search over the transcribed text or your own documents (markdown-based).
- Modular design following clean code principles, with separate classes for embeddings, search, and server.
- Flexible Embeddings with support for llama.cpp or custom embedding providers.
-
Clone this repository:
git clone https://github.com/boringresearch/rag-demo.git cd rag-demo
-
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # on Linux/Mac .\venv\Scripts\activate # on Windows
-
Install requirements (python<=3.11):
pip install -r requirements.txt
-
Configure environment:
- Copy
.env.example
to.env
- Insert your
ASSEMBLYAI_API_KEY
inside.env
- Configure your embedding API URL in
.env
if using a custom embedding service
- Copy
-
Run the application:
python src/main.py
- Server listens on http://localhost:8002
- Open your browser at http://localhost:8002.
- Upload an audio/video file or choose an example to see transcription.
- Type a query in the search box to perform semantic search over the transcribed content or custom text.
This project uses llama.cpp for generating embeddings by default. To set up the embedding server:
- Install llama.cpp following the instructions in their repository
- Download a compatible model (e.g., a GGUF format model)
- Run the llama-server with embeddings enabled:
./llama-server -m model-f16.gguf --embeddings -c 512 -ngl 99 --host 0.0.0.0
- Update your
.env
file with the correct embedding API URL (default:http://localhost:8080
)
The llama.cpp server expects embedding requests in the following format:
POST /embedding
{
"content": "text to embed"
}
The response will contain the embedding vector:
{
"embedding": [0.123, 0.456, ...]
}
If you encounter issues with the embedding API:
- Check that the llama-server is running with the
--embeddings
flag - Verify the API URL in your
.env
file matches the server address - Test the API directly using curl:
curl -X POST http://localhost:8080/embedding \ -H "Content-Type: application/json" \ -d '{"content":"test text"}'
- Check server logs for any error messages
- Try using the FakeEmbeddings provider for testing by setting
EMBEDDING_PROVIDER=fake
in your.env
file
The project is designed to make it easy to switch between different embedding providers:
- Create a new class that implements the
EmbeddingsBase
interface insrc/embeddings/
- Update the
TermsSearchEngine
initialization insrc/server/app.py
to use your custom embeddings class - Alternatively, set the
EMBEDDING_PROVIDER
environment variable to switch between implemented providers
rag-demo/
├── LICENSE
├── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── requirements.txt
├── .env.example
├── src/
│ ├── main.py
│ ├── server/
│ │ ├── __init__.py
│ │ └── app.py
│ ├── embeddings/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── fake.py
│ │ └── llama.py
│ ├── search/
│ │ ├── __init__.py
│ │ └── terms_search_engine.py
│ └── templates/
│ └── index.html
├── static/
├── cache/
├── examples/
└── uploads/
This project is licensed under the MIT License.