Characters that remember, share experiences with you in games.
🎥 Watch the 1 minute Demo Video
Vision: This project is the proof-of-concept for a future startup aimed at revolutionizing interactive characters in gaming.
- 🧠 Brain: GPT OSS 120B — conscious reasoning and conversation
 - 💾 Subconscious: Qdrant — stores:
image-memories: CLIP Embeddings + scene description + locationevent-memories: Event description embeddings + text + location
 - 👁️ Vision: Liquid AI's LFM-2-VL-450M generates image descriptions via llama-cpp
 - 🗣️ Voice:
Transcription: whisper-cppSpeech SynthesisPiperTTS via libpiper (usingonnx-runtimeandespeak-ng)
 - 🧩 Embeddings:
Image Embeddings: clip-cppText Embeddings: all-MiniLM-L6-v2 (viaonnx-runtime, HuggingFace tokenizers usingRust C ABI)
 - 🎮 Engine: Unreal Engine 5.6 — fully integrated via custom C++ modules
 
Memories are retrieved via semantic similarity. The NPC recalls moments not strings.
e.g. when the NPC is around a dangerous area, then next time the player plays the game, the NPC already knows where that dangerous area is and why is it dangerous.
Has immense possibilities because of unprecedented parallism and optimizations that made this all run on low-mid tier devices very fast.
Opens up a LOT of possibilties for novel personalized in-game experiences.
Everything except the conscious orchestrator (
gpt-oss-120b) run locally resulting in minimal costs of less than$1 for a whole day per playerassuming gameplay of8-10 hours.
- VRAM consumption is around 
3 GB, enabling gameplay on a mid-tier laptop (e.g., i5-12400F + 3050 4GB) - Latency is 
1-2 secondsper response resulting in life-like interactions 
| Model | Param count | Quantization | Device | Offline (Runs Locally) | 
|---|---|---|---|---|
Vision (LFM2-VL) | 
450M | 4 bit | GPU | ✅ | 
Visual Embedding (CLIP-ViT-B-32) | 
151M | 4 bit | CPU | ✅ | 
Text Embedding (all-MiniLM-L6-v2) | 
22M | None | CPU | ✅ | 
Text to Speech (piper-voices) | 
15M | None | CPU | ✅ | 
Speech to Text  (whisper-base) | 
72M | 5 bit | CPU | ✅ | 
LLM (gpt-oss-120b) | 
120B (5b MoE) | 4 bit | Groq' LPU | ❌ | 
Local captioning (Liquid AI) costs ~100 tokens/image vs 500+ tokens for raw image upload (just for 512x512 px image).
- 5x cheaper LLM usage
 - Heavily reduced latency because of novel parallelism we use
 - No images leave device → 
privacy preserved(for future usecases) 
- 
Source/—clip/: clip-cpp sourcelibpiper/: libpiper source- ClipSubsystem
 - WhisperSubsystem
 - QdrantSubsystem
 - TextEmbeddingSubsystem
 - VisionSubsystem
 - PiperTTSSubsystem
 - LLMSubsystem
 
 - 
ThirdParty/- dlls of external libraries with headers- espeak
 - onnxruntime
 - tokenizers
 - whisper
 
 
| Component | License | 
|---|---|
qdrant | 
Apache 2.0 (GitHub) | 
onnx-runtime | 
MIT (GitHub) | 
llama.cpp | 
MIT (GitHub) | 
whisper.cpp | 
MIT (GitHub) | 
clip.cpp | 
MIT (GitHub) | 
pipertts | 
GPL-3.0 (GitHub) | 
espeak-ng | 
GPL-3.0 (GitHub) | 
| Model | License | 
|---|---|
gpt-oss-120b | 
Apache 2.0 (Hugging Face) | 
all-MiniLM-L6-v2 | 
Apache 2.0 (Hugging Face) | 
whisper-base-en | 
Apache 2.0 (HuggingFace) | 
CLIP-ViT-B-32 | 
MIT (HuggingFace) | 
piper-voices | 
MIT (HuggingFace) | 
LFM2-VL | 
LFM Open License v1.0 (Hugging Face) | 
© 2025 — Stealth Startup. All Rights Reserved.
