NPCs with Spatio-Temporal Awareness

Characters that remember, share experiences with you in games.

No keywords. No scripts. Just spatio-temporal memory via Qdrant + multimodal embeddings.

Vision: This project is the proof-of-concept for a future startup aimed at revolutionizing interactive characters in gaming.

🌙 How does it work?

🧠 Brain: GPT OSS 120B — conscious reasoning and conversation
💾 Subconscious: Qdrant — stores:
- image-memories: CLIP Embeddings + scene description + location
- event-memories: Event description embeddings + text + location
👁️ Vision: Liquid AI's LFM-2-VL-450M generates image descriptions via llama-cpp
🗣️ Voice:
- Transcription: whisper-cpp
- Speech Synthesis PiperTTS via libpiper (using onnx-runtime and espeak-ng)
🧩 Embeddings:
- Image Embeddings: clip-cpp
- Text Embeddings: all-MiniLM-L6-v2 (via onnx-runtime, HuggingFace tokenizers using Rust C ABI)
🎮 Engine: Unreal Engine 5.6 — fully integrated via custom C++ modules

Memories are retrieved via semantic similarity. The NPC recalls moments not strings.

e.g. when the NPC is around a dangerous area, then next time the player plays the game, the NPC already knows where that dangerous area is and why is it dangerous.

Has immense possibilities because of unprecedented parallism and optimizations that made this all run on low-mid tier devices very fast.

Opens up a LOT of possibilties for novel personalized in-game experiences.

📊 Performance and Costs to run

Everything except the conscious orchestrator (gpt-oss-120b) run locally resulting in minimal costs of less than $1 for a whole day per player assuming gameplay of 8-10 hours.

VRAM consumption is around 3 GB, enabling gameplay on a mid-tier laptop (e.g., i5-12400F + 3050 4GB)
Latency is 1-2 seconds per response resulting in life-like interactions

Quantizations

Model	Param count	Quantization	Device	Offline (Runs Locally)
Vision (`LFM2-VL`)	450M	4 bit	GPU	✅
Visual Embedding (`CLIP-ViT-B-32`)	151M	4 bit	CPU	✅
Text Embedding (`all-MiniLM-L6-v2`)	22M	None	CPU	✅
Text to Speech (`piper-voices`)	15M	None	CPU	✅
Speech to Text (`whisper-base`)	72M	5 bit	CPU	✅
LLM (`gpt-oss-120b`)	120B (5b MoE)	4 bit	Groq' LPU	❌

💡 Why not a Vision Language Model on cloud?

Local captioning (Liquid AI) costs ~100 tokens/image vs 500+ tokens for raw image upload (just for 512x512 px image).

5x cheaper LLM usage
Heavily reduced latency because of novel parallelism we use
No images leave device → privacy preserved (for future usecases)

🧭 Architecture

📁 What’s in This Repo

Source/ —
- clip/: clip-cpp source
- libpiper/: libpiper source
- ClipSubsystem
- WhisperSubsystem
- QdrantSubsystem
- TextEmbeddingSubsystem
- VisionSubsystem
- PiperTTSSubsystem
- LLMSubsystem
ThirdParty/ - dlls of external libraries with headers
- espeak
- onnxruntime
- tokenizers
- whisper

Open-Source Licenses

Component	License
`qdrant`	Apache 2.0 (GitHub)
`onnx-runtime`	MIT (GitHub)
`llama.cpp`	MIT (GitHub)
`whisper.cpp`	MIT (GitHub)
`clip.cpp`	MIT (GitHub)
`pipertts`	GPL-3.0 (GitHub)
`espeak-ng`	GPL-3.0 (GitHub)

Licenses of Models

Model	License
`gpt-oss-120b`	Apache 2.0 (Hugging Face)
`all-MiniLM-L6-v2`	Apache 2.0 (Hugging Face)
`whisper-base-en`	Apache 2.0 (HuggingFace)
`CLIP-ViT-B-32`	MIT (HuggingFace)
`piper-voices`	MIT (HuggingFace)
`LFM2-VL`	LFM Open License v1.0 (Hugging Face)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Source		Source
ThirdParty		ThirdParty
README.md		README.md
architecture-diagram.png		architecture-diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NPCs with Spatio-Temporal Awareness

No keywords. No scripts. Just spatio-temporal memory via Qdrant + multimodal embeddings.

🌙 How does it work?

📊 Performance and Costs to run

Quantizations

💡 Why not a Vision Language Model on cloud?

🧭 Architecture

📁 What’s in This Repo

Open-Source Licenses

Licenses of Models

About

Uh oh!

Releases

Packages

Languages

inventwithdean/spatio-temporal-npcs

Folders and files

Latest commit

History

Repository files navigation

NPCs with Spatio-Temporal Awareness

No keywords. No scripts. Just spatio-temporal memory via Qdrant + multimodal embeddings.

🌙 How does it work?

📊 Performance and Costs to run

Quantizations

💡 Why not a Vision Language Model on cloud?

🧭 Architecture

📁 What’s in This Repo

Open-Source Licenses

Licenses of Models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages