Skip to content

inventwithdean/spatio-temporal-npcs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NPCs with Spatio-Temporal Awareness

Characters that remember, share experiences with you in games.

No keywords. No scripts. Just spatio-temporal memory via Qdrant + multimodal embeddings.

🎥 Watch the 1 minute Demo Video

🎥 Extended Demo

Vision: This project is the proof-of-concept for a future startup aimed at revolutionizing interactive characters in gaming.


🌙 How does it work?

  • 🧠 Brain: GPT OSS 120B — conscious reasoning and conversation
  • 💾 Subconscious: Qdrant — stores:
    • image-memories: CLIP Embeddings + scene description + location
    • event-memories: Event description embeddings + text + location
  • 👁️ Vision: Liquid AI's LFM-2-VL-450M generates image descriptions via llama-cpp
  • 🗣️ Voice:
    • Transcription: whisper-cpp
    • Speech Synthesis PiperTTS via libpiper (using onnx-runtime and espeak-ng)
  • 🧩 Embeddings:
    • Image Embeddings: clip-cpp
    • Text Embeddings: all-MiniLM-L6-v2 (via onnx-runtime, HuggingFace tokenizers using Rust C ABI)
  • 🎮 Engine: Unreal Engine 5.6 — fully integrated via custom C++ modules

Memories are retrieved via semantic similarity. The NPC recalls moments not strings.

e.g. when the NPC is around a dangerous area, then next time the player plays the game, the NPC already knows where that dangerous area is and why is it dangerous.

Has immense possibilities because of unprecedented parallism and optimizations that made this all run on low-mid tier devices very fast.

Opens up a LOT of possibilties for novel personalized in-game experiences.


📊 Performance and Costs to run

Everything except the conscious orchestrator (gpt-oss-120b) run locally resulting in minimal costs of less than $1 for a whole day per player assuming gameplay of 8-10 hours.

  • VRAM consumption is around 3 GB, enabling gameplay on a mid-tier laptop (e.g., i5-12400F + 3050 4GB)
  • Latency is 1-2 seconds per response resulting in life-like interactions

Quantizations

Model Param count Quantization Device Offline (Runs Locally)
Vision (LFM2-VL) 450M 4 bit GPU
Visual Embedding (CLIP-ViT-B-32) 151M 4 bit CPU
Text Embedding (all-MiniLM-L6-v2) 22M None CPU
Text to Speech (piper-voices) 15M None CPU
Speech to Text (whisper-base) 72M 5 bit CPU
LLM (gpt-oss-120b) 120B (5b MoE) 4 bit Groq' LPU

💡 Why not a Vision Language Model on cloud?

Local captioning (Liquid AI) costs ~100 tokens/image vs 500+ tokens for raw image upload (just for 512x512 px image).

  • 5x cheaper LLM usage
  • Heavily reduced latency because of novel parallelism we use
  • No images leave device → privacy preserved (for future usecases)

🧭 Architecture

Architecture Diagram


📁 What’s in This Repo

  • Source/

    • clip/: clip-cpp source
    • libpiper/: libpiper source
    • ClipSubsystem
    • WhisperSubsystem
    • QdrantSubsystem
    • TextEmbeddingSubsystem
    • VisionSubsystem
    • PiperTTSSubsystem
    • LLMSubsystem
  • ThirdParty/ - dlls of external libraries with headers

    • espeak
    • onnxruntime
    • tokenizers
    • whisper

Open-Source Licenses

Component License
qdrant Apache 2.0 (GitHub)
onnx-runtime MIT (GitHub)
llama.cpp MIT (GitHub)
whisper.cpp MIT (GitHub)
clip.cpp MIT (GitHub)
pipertts GPL-3.0 (GitHub)
espeak-ng GPL-3.0 (GitHub)

Licenses of Models

Model License
gpt-oss-120b Apache 2.0 (Hugging Face)
all-MiniLM-L6-v2 Apache 2.0 (Hugging Face)
whisper-base-en Apache 2.0 (HuggingFace)
CLIP-ViT-B-32 MIT (HuggingFace)
piper-voices MIT (HuggingFace)
LFM2-VL LFM Open License v1.0 (Hugging Face)

© 2025 — Stealth Startup. All Rights Reserved.

About

Blazing fast NPCs with Spatio-Temporal awareness inside Unreal Engine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published