Deploy this blueprint to create a production-grade autotonomous Data Flywheel service that uses the NeMo Microservices platform to continuously discover and promote more efficient models.
Data Flywheels are a fledgling concept in GenerativeAI, but already real-world tests within NVIDIA have identified instances where using a flywheel can reduce inference costs by up to 98.6%. There are caveats to this which we discuss below, but we believe these early data points warrant attention.
The purpose of this Blueprint is two-fold:
- To provide a production-grade reference implementation of a Data Flywheel on top of the NeMo Microservice Platform.
- To educate the community on what Data Flywheels are: what they can do, what they can't do, and what to expect when building them.
You can get started quickly and achieve similar results using your own infrastructure by following the Quickstart guide.
- Data Flywheel Foundational Blueprint
A data flywheel is a process that uses data exhaust from production applications (for example, LLM prompt/response logs, end user feedback, and expert labeling) to increase the overall accuracy and reduce the latency/cost of Generative AI systems. A very high level flow looks like this:
flowchart TD
app[Your App] --prompts/responses/feedback--> logs[Log service]
logs --Create Datasets--> orch["Orchestrator"]
orch --> exp1["Exp #1"]
orch --> exp2["Exp #2"]
orch --> expN["Exp #N"]
exp1 --> results
exp2 --> results
expN --> results
Production traffic from your application is routed to a centralized logging service. From there, evaluation and fine-tuning datasets are created and used in a series of offline experiments. Anyone who has done this manually knows there are a lot of options to consider when designing the experiments, and many of them depend on the kinds of data you have:
- Which models do you pick?
- In what ways do you curate your datasets?
- Which fine-tuning techniques do you try?
- Which metrics do you use in evals?
It's a lot to decide on. Enter: The NeMo Microservice Platform.
The NeMo Microservice platform allows for programmatic control of datasets, fine-tuning, evaluation, and inference. This means that rather than having ML engineers manage each experiment, you can automate the exploration of various configurations using sensible defaults, and then present the most promising candidates to a research engineer or machine learning engineer for further analysis.
flowchart TD
app["Your application"] --Prompt/completion logs--> log_store["Log Store"]
log_store --Datasets--> datasets["NeMo Datastore"]
datasets --"Fine-tuning datasets"--> customizer["NeMo Customizer"]
datasets --"Eval datasets"--> evaluator["NeMo Evaluator"]
subgraph NIMs["Loop across ALL NIMs"]
customizer --"Customized model"--> NIM
evaluator --> NIM
NIM --> evaluator
end
evaluator --> results["Flywheel Results"]
In just a few hours, this automated process built on top of NeMo microservices can:
- Pull data from your log store.
- Group it by task (for example, if you have an agent doing multiple things, each node is a different task).
- De-dup it.
- Create eval and fine-tuning datasets from your production traffic and store them in NeMo Datastore.
- Kick off fine-tuning jobs with NeMo Customizer.
- Run evaluations with LLM-as-judge comparisons on NeMo Evaluator.
With reasonable defaults, the system automatically narrows a vast number of possible options down to a manageable set of promising candidates for further analysis—-no manual experiment design required.
👆 This is the key insight of the Data Flywheel Blueprint built on top of NeMo microservices.
You can scale this process across any number of NIMs by using NeMo Deployment Manager to dynamically start and stop NIMs as needed, so you don't have to keep every NIM running at all times. This cycle can be repeated as frequently as desired: daily, weekly, or on your own schedule.
The strategies in this reference implementation have proven effective in some use cases, but our work is ongoing. Some aspects of this implementation may go against common wisdom:
- Routing production traffic directly into fine-tuning without PII removal
- Building evaluation datasets with no ground truth other than what the production model is responding with
- Not doing any hand-labeling of data
Nonetheless, we've shown it can work. And more importantly, we believe this idea of collecting production traffic prompt/response logs, running automated experiments to explore a huge solution space, and then flagging interesting candidates for further analysis will become an indispensable part of the future GenerativeAI stack.
Therefore, to effectively use this blueprint:
-
Learn from the reference implementation
- Play with the Launchable: Walk through setting up NeMo microservices, deploying the reference services, and exercising the flywheel with the provided sample dataset.
- Read the code & docs: Finish this README and skim the source to understand how the API layer, background tasks, and NeMo microservices integrations fit together.
-
Prepare your traffic
- Instrument production applications: Every distinct LLM call (agent node, route, tool, etc.) must emit a stable
workload_id
. Optionally include a free-form description string—ignored by inference but useful for future workload classification. - Export or connect your logs: If you're already logging prompt/response pairs, write a small connector (or use an existing one) to push them into Elasticsearch or directly into the Flywheel API.
- Instrument production applications: Every distinct LLM call (agent node, route, tool, etc.) must emit a stable
-
Choose how to run the Flywheel
- Always-on service: Keep the stack running in a shared k8s/VM environment so new traffic is evaluated continuously.
- Ad-hoc run: Spin it up locally on a workstation with a few GPUs, load a slice of traffic, kick off a job, and shut everything down when the results are in.
-
Kick off a run
- Load or stream the tagged traffic into Elasticsearch to launch a job. The system will spin up the necessary NIMs, schedule evaluations and fine-tunes, and track everything automatically.
-
Interpret the results
- The response is grouped by NIM. For each NIM the Flywheel currently runs three experiment types: • base – raw production prompts replayed. • icl – few-shot prompts built from random traffic examples. • customized – a LoRA fine-tune evaluated with the base prompts.
- Scores are an LLM-as-judge similarity metric in the range
[0, 1]
. Look for high-scoring small models, then download the datasets, LoRA adapters, or model artifacts for manual inspection and further analysis.
-
Keep the human in the loop
- Think of the Flywheel as a flashlight, not an autopilot. Promotion to production—as well as any deeper evaluation or dataset curation—remains a human decision.
-
Stay up to date
- Follow the repo for UI improvements, new experiment strategies, and future work on accuracy-oriented flywheels.
The Flywheel treats your production prompt / completion logs as the single source of truth. At run-time it only needs to know where to find the logs (e.g. an Elasticsearch index) and how the individual documents are shaped. Since this is a reference implementation, you can modify the code to suit your needs, but the current schemas are defined below should you decide to conform to them.
Each Elasticsearch document must contain the following top-level keys:
Field | Type | Description |
---|---|---|
timestamp |
int (epoch secs) |
Time the request was issued |
workload_id |
str |
Stable identifier for the logical task / route / agent node |
client_id |
str |
Identifier of the application or deployment that generated traffic |
request |
dict |
Exact openai.ChatCompletion.create payload received by the model |
response |
dict |
Exact ChatCompletion response returned by the model |
A minimal example document therefore looks like:
Why this shape? Keeping the full request/response allows the Flywheel to replay prompts, build few-shot demonstrations, and fine-tune without lossy conversions.
Tagging matters •
client_id
is meant to identify who produced the traffic (for example a specific micro-service, environment, or customer). Multiple workloads can share the sameclient_id
.workload_id
is much stricter: it represents a single type of request. If your application is an agent with several nodes you must assign a differentworkload_id
to every node so the Flywheel can evaluate them independently. Treat it as the primary key for slicing, deduplication, and dataset creation.
If you already write request/response logs, you can either route that traffic to a production Elasticsearch instance that you manage or bulk import them into the Elasticsearch instance started by docker-compose
. For new projects the snippet below shows how a synchronous OpenAI call can be wrapped so every interaction is written in the expected format.
💡 For a more comprehensive example of instrumenting an application using LangChain with an AI Virtual Assistant (AIVA), see our AIVA Data Logging Example.
# examples/log_to_es.py
import os, time, uuid
from elasticsearch import Elasticsearch
from openai import OpenAI
ES_URL = os.getenv("ELASTICSEARCH_URL", "http://localhost:9200")
ES_INDEX = os.getenv("ES_COLLECTION_NAME", "flywheel")
es = Elasticsearch(hosts=[ES_URL])
openai_client = OpenAI()
CLIENT_ID = "my_demo_app"
# Example agent nodes (each with its own workload_id)
WORKLOADS = {
"simple_chat": "agent.chat",
"tool_router": "agent.tool_router",
}
def log_chat(workload_id: str, messages: list[dict]):
# 1) call the LLM
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=messages,
temperature=0.3,
)
# 2) build the document
doc = {
"timestamp": int(time.time()),
"workload_id": workload_id,
"client_id": CLIENT_ID,
"request": {
"model": response.model,
"messages": messages,
"temperature": 0.3,
"max_tokens": 1024,
},
"response": response.model_dump(), # OpenAI python-sdk v1 returns a pydantic model
}
# 3) write to Elasticsearch
es.index(index=ES_INDEX, document=doc, id=str(uuid.uuid4()))
# --- Example usage -----------------------------------------------------------
messages_chat = [{"role": "user", "content": "Hello!"}]
log_chat(WORKLOADS["simple_chat"], messages_chat)
messages_tool = [
{"role": "user", "content": "Who won the 2024 Super Bowl?"},
{
"role": "system",
"content": "You are a router that decides whether to call the Wikipedia tool or answer directly.",
},
]
log_chat(WORKLOADS["tool_router"], messages_tool)
💡 Streaming responses: the OpenAI SDK delivers tokens incrementally in streaming mode. If you are using streaming mode in your clients, you will need to take that into account and either buffer the stream and reconstruct a full response
object before indexing, or modify the Flywheel importer to reconstruct the full response.
The reference implementation already bundles helpers you can reuse:
src/scripts/load_test_data.py
– CLI to bulk-load a JSONL file into the Flywheel index.src/tasks/tasks.py::create_datasets
– Celery task that reads logs, deduplicates them, and turns them into evaluation & training datasets.
Feel free to swap Elasticsearch for another store or adjust the mapping – just make sure create_datasets
can still retrieve documents in the schema above.
The Data Flywheel Foundational Blueprint is a reference implementation of some of the techniques we have explored within NVIDIA. While far from a silver bullet, real-world tests within NVIDIA have found instances where using a Data Flywheel can reduce inference costs by up to 98.6%, while maintaining comparable accuracy. These use cases tend to center around simpler tool calling use cases where an agent is using a tool call to route between a small set of tools. In the 98.6% example, we took traffic from an internal HR chatbot. This chat used an LLM -- llama-3.1-70b-instruct
-- for several purposes:
- Chat
- Query rewriting for RAG
- Summarization
- Tool calling
The Flywheel identified that the Tool Calling use case ended up being simple enough that a fine-tuned llama-3.2-1b-instruct
was able to achieve ~98% accuracy relative to the 70b model being used in production.
We have also found instances where Qwen-2.5-32b-coder
did as well as Llama-3.1-70b-instruct
without any fine-tuning, and so were able to quickly reduce inference costs and time to first token by >50% by swapping out a NIM.
You can also learn more about Flywheels here:
- Enhance Your AI Agent with Data Flywheels Using NVIDIA NeMo Microservices
- Nvidia Releases NeMo Microservices To Streamline AI Agent Development
- Overview of NeMo Microservices
- Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity
- Data Collection and Storage:
- Elasticsearch for logging prompt/completion data
- MongoDB for API and metadata storage
- Redis for task queue management
- Model Integration:
- Support for Meta Llama 3.2 1B Instruct model
- Configurable context length up to 32768 tokens
- Training and Evaluation:
- In-context learning (ICL) with configurable parameters
- LoRA-based fine-tuning support
- Automated data splitting for evaluation
- Deployment Infrastructure:
- Docker Compose setup for development
- Celery workers for background processing
- Health monitoring for core services
- Resource Management:
- Automatic cleanup of running resources during system shutdown
- Manual cleanup scripts for maintenance operations
- Comprehensive error handling and logging
The Data Flywheel Foundational Blueprint empowers organizations to accelerate the optimization of AI models for cost and performance. Rather than offering a one-size-fits-all solution, this blueprint provides a practical framework and proven tools to guide your journey toward more efficient, production-ready models.
-
Reference Implementation for Real-World Impact This blueprint demonstrates the capabilities of NeMo Microservices, providing a foundation you can adapt and extend to meet your specific production requirements.
-
Streamlined Human Oversight The Flywheel process is designed to minimize manual intervention. Human review is reserved for evaluating candidate models after automated assessments, eliminating the need for ongoing user feedback or manual labeling.
-
Cost and Latency Optimization The primary focus is on reducing inference costs and latency by distilling larger models into smaller, high-quality alternatives. Future updates will further enhance accuracy and introduce advanced prompt and agent optimization features.
-
Iterative, Data-Driven Improvement Each iteration provides valuable insights, even when a smaller model is not immediately identified. This iterative approach ensures continuous learning and improvement.
-
Seamless Integration with Existing Workflows
- Designed for teams with existing generative AI applications in production.
- Easily integrates with your current logging and workload tagging practices.
- Supports enhanced workload descriptions for improved future classification.
- Leverages robust infrastructure—including Elasticsearch, MongoDB, Redis, and NeMo microservices—to store data, build datasets, run evaluations, fine-tune models, and re-evaluate results.
To get the most value from the Data Flywheel Foundational Blueprint, ensure you have:
- An existing generative AI application in production.
- Logging of prompt/completion traffic, with workload tagging (such as routes, nodes, or agent steps).
- (Optional, but recommended) Descriptive metadata for each workload to support future classification.
- The ability to deploy and operate supporting infrastructure (Elasticsearch, MongoDB, Redis, and NeMo microservices) for data storage, dataset creation, evaluation, and fine-tuning.
By following this blueprint, you can confidently advance your AI model optimization initiatives, leveraging a process that is transparent, adaptable, and focused on measurable outcomes.
The blueprint purposely keeps the first release simple. Areas we are actively exploring for future versions include:
Theme | Example Ideas |
---|---|
Automated Data Collection | Integrated collection of model inputs/outputs, latency, and metadata |
Visualization Dashboards | Pre-built Grafana/Kibana dashboards for cost, latency, drift and accuracy trends |
Agentic Observability & Prompt Insights | Detect regression, drift, or improvement trends based on performance telemetry |
Dynamic Configuration Overrides | Runtime overrides for config.yaml settings via API or environment variables |
Data Governance & Privacy | PII redaction pipeline support for logs and datasets; fine-grained RBAC on dataset access and usage |
Data Governance & Privacy | Enable experiment tracking tooling for granular tracking of fine-tune runs, metrics, artifacts, and config diffs |
Hyper-parameter Sweeps | Support launching NeMo microservices hyper-parameter sweeps from external tools (e.g. Flywheel) and pulling results back for analysis and visualization |
Smarter Dataset Construction | Heuristics or LLM-based parsing to derive eval vs. fine-tune splits from logs; support for DPO/KTO pipelines or filtering by thumbs-up/down signal |
Model & Backend Extensibility | Add support for additional NIMs such as Qwen, LLaMA-Nemotron, and Mistral; include testing and evaluation support for quantized models |
The blueprint consists of the following implemented components:
- API Layer:
- FastAPI-based REST endpoints (
src/api/endpoints.py
) - Data models and schemas (
src/api/models.py
,src/api/schemas.py
) - Job service for task management (
src/api/job_service.py
)
- FastAPI-based REST endpoints (
- Data Storage:
- Elasticsearch for log storage
- MongoDB for API data persistence (
src/api/db.py
) - Redis for task queue
- Task Processing:
- Celery workers for background jobs (
src/tasks/tasks.py
) - Configurable concurrency and monitoring
- Celery workers for background jobs (
- NeMo Microservices Integration:
- Datastore client for dataset management
- Model evaluation and customization interfaces
- Configurable NeMo microservices endpoints
For details on the architecture of a Flywheel and the components of this Blueprint, view the Architecture Overview.
Requirement Type | Details |
---|---|
Minimum GPU | Self-hosted LLM Judge: 6× (NVIDIA H100 or A100 GPUs) Remote LLM Judge: 2× (NVIDIA H100 or A100 GPUs) |
Cluster | Single-node NVIDIA GPU cluster on Linux with cluster-admin permissions |
Disk Space | At least 200 GB free |
Software | Python 3.11 Docker Engine Docker Compose v2 |
Services | Elasticsearch 8.12.2 MongoDB 7.0 Redis 7.2 FastAPI (API server) Celery (task processing) |
Resource | Minimum Memory: 1GB (512MB reserved for Elasticsearch) Storage: Varies by log volume/model size Network: Ports 8000 (API), 9200 (Elasticsearch), 27017 (MongoDB), 6379 (Redis) |
Development | Docker Compose for local dev with hot reloading Supports macOS (Darwin) and Linux Optional: GPU support for model inference |
Production | Kubernetes cluster (recommended) Resources scale with workload Persistent volume support for data storage |
Why only one Flywheel run at a time? When the Flywheel kicks off a run it may need to spin up multiple NIMs and customization jobs, each of which can claim one or more GPUs. The reference implementation does not yet discover the number of free GPUs in the cluster, so it uses a simple guardrail: all invocations of run_nim_workflow_dag
are serialized.
- The task is bound to a dedicated Celery queue (
parent_queue
). In thedocker-compose.yaml
there is a worker dedicated to this queue whose concurrency is set to1
. There is a second worker bound to the defaultcelery
queue which can handle running other tasks (e.g. evals) in parallel. - Inside the task we wait for the full DAG to complete via
async_result.get(...)
before returning. - The call to create a job (i.e.
POST /jobs
) will not block, however. It will return immediately with a Job ID
This ensures that only one Flywheel experiment can allocate GPUs at any given time, preventing accidental overallocation that would lead to failed NIM deployments or customizations.
Roadmap – Automatic GPU introspection and smarter scheduling are planned for a future version of the Blueprint so multiple Flywheel runs can execute in parallel when resources permit.
- Review the Architecture Overview
- Follow the Quickstart Guide to deploy this blueprint
- Explore the full API Specification to understand all available endpoints
- Read the Audience Guide to understand stakeholder responsibilities
- Review Limitations & Best Practices before promoting any model
- Review the Evaluation Types and Metrics to understand available evaluation types.
The following are some of the customizations that you can make after you complete the steps in the Quickstart Guide.
Category | Description | Available Options |
---|---|---|
Environment Variables | Configure system using environment variables | • Required Variables: NGC_API_KEY, HF_TOKEN • Optional Variables: ES_COLLECTION_NAME, ELASTICSEARCH_URL, MONGODB_URL, REDIS_URL • Configuration: Via .env file or system environment |
Model Integration | Configure and deploy LLM models | • Currently Supported: Meta Llama 3.2 1B Instruct • Context Length: Up to 32768 tokens • Hardware Config: GPU support (configurable), PVC size (configurable) • Version Control: Model tags supported |
Evaluation Settings | Configure data splitting and evaluation parameters | • Data Split: Eval size (default: 20), validation ratio (0.1) • Minimum Records: 50 records required • Reproducibility: Optional random seed • ICL Settings: Context length (max 32768), reserved tokens (4096), examples (min 1, max 3) |
Fine-tuning Options | Customize model training | • Training Type: SFT (Supervised Fine-Tuning) • Method: LoRA with configurable parameters • Parameters: epochs (2), batch size (16), learning rate (0.0001) • LoRA Config: adapter dimension (32), dropout (0.1) |
Data Infrastructure | Configure data storage and processing | • Storage: Elasticsearch for logs • Queue: Redis for task processing • Database: MongoDB for API data • Processing: Celery workers with configurable concurrency |
Deployment Options | Infrastructure configuration | • Development: Docker Compose with hot reloading • Services: API, Celery Worker, Redis, MongoDB, Elasticsearch • Resource Config: Network mode, volume mounts, health checks • Environment: Configurable URLs and API keys |
Refer to the Configuration Guide for more information.
-
Install development dependencies:
uv sync --dev
This command installs all dependencies needed to build the container and run the tests.
-
Start required services:
./scripts/run.sh
This starts the necessary services via docker compose that are required for testing.
-
Run the tests:
-
For unit tests (requires MongoDB from docker compose):
uv run pytest
-
For integration tests (with mocked NeMo microservices components):
uv run pytest -m integration
-
-
Clean up after development:
-
Stop all services:
./scripts/stop.sh
-
(Optional) Clear all database volumes:
./scripts/clear_all_volumes.sh
-
If you modify the API, regenerate the openapi.json with the following command:
uv run python scripts/generate_openapi.py
This NVIDIA AI BLUEPRINT is licensed under the Apache License, Version 2.0. This project will download and install additional third-party open source software projects and containers. Review the license terms of these open source projects before use.
The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/), except that models are governed by the AI Foundation Models Community License Agreement (found at NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License) and NVIDIA dataset is governed by the NVIDIA Asset License Agreement found here.
For Meta/llama-3.1-70b-instruct model the Llama 3.1 Community License Agreement, for nvidia/llama-3.2-nv-embedqa-1b-v2model the Llama 3.2 Community License Agreement, and for the nvidia/llama-3.2-nv-rerankqa-1b-v2 model the Llama 3.2 Community License Agreement. Built with Llama.
The Data Flywheel Blueprint is shared as reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities,secure the communication channels, integrate AuthN & AuthZ with appropriate controls.