Deploy this blueprint to create a production-grade autonomous Data Flywheel service that uses the NeMo microservices to continuously discover and promote more efficient models.
Weights & Biases has augmented the Data Flywheel Blueprint to give you instant tracing and visibility into your production AI systems, as well as a platform to iterate on them to bring them to production. The following services have been enabled with W&B Weave:
- Production Traffic Monitoring
- Model Evaluation Tracking
- Fine-tuning Experiment Management
- Model Artifact Versioning
- Dataset Serialization and Versioning
This provides you with the data to be successful in your implementation of this Flywheel blueprint all the way to production. The integration is activated when setting WANDB_API_KEY
when deploying the blueprint.
NOTE:
- By default, tracing is logged to your default team in W&B, under project
data-flywheel
. You can override this by settingWANDB_PROJECT
as well. - Ensure to set
WANDB_API_KEY
andWANDB_PROJECT
in your environment configuration. - Model artifacts (including fine-tuned models, LoRA adapters, and evaluation results) are automatically logged to W&B for versioning and tracking.
- Datasets created from Weave traces are serialized and versioned in W&B, enabling reproducible experiments and easy dataset sharing across teams.
Data Flywheels are a fledgling concept in GenerativeAI, but already real-world tests within NVIDIA have identified instances where using a flywheel can reduce inference costs by up to 98.6%. There are caveats to this which we discuss below, but we believe these early data points warrant attention.
The purpose of this Blueprint is two-fold:
- To provide a production-grade reference implementation of a Data Flywheel on top of the NeMo microservices.
- To educate the community on what Data Flywheels are: what they can do, what they can't do, and what to expect when building them.
You can get started quickly and achieve similar results using your own infrastructure by following the Quickstart guide.
- Data Flywheel Foundational Blueprint with Weights & Biases Weave
A data flywheel is a process that uses data exhaust from production applications (for example, LLM prompt/response logs, end user feedback, and expert labeling) to increase the overall accuracy and reduce the latency/cost of Generative AI systems. A very high level flow looks like this:
flowchart TD
app[Your App] --prompts/responses/feedback--> logs[Weights & Biases Weave]
logs --Create Datasets--> orch["Orchestrator"]
orch --> exp1["Exp #1"]
orch --> exp2["Exp #2"]
orch --> expN["Exp #N"]
exp1 --> results[Weights & Biases Models]
exp2 --> results
expN --> results
Production traffic from your application is routed to Weights & Biases Weave for comprehensive logging and monitoring. From there, evaluation and fine-tuning datasets are created and used in a series of offline experiments, with results tracked in Weights & Biases Models. The system automatically:
- Logs model artifacts (including fine-tuned models and LoRA adapters) to W&B
- Serializes and versions datasets created from Weave traces
- Tracks experiment metadata and evaluation results
- Links evaluation runs to model artifacts for complete lineage tracking
Anyone who has done this manually knows there are a lot of options to consider when designing the experiments, and many of them depend on the kinds of data you have:
- Which models do you pick?
- In what ways do you curate your datasets?
- Which fine-tuning techniques do you try?
- Which metrics do you use in evals?
It's a lot to decide on. Enter: The NeMo microservice.
The NeMo microservices allows for programmatic control of datasets, fine-tuning, evaluation, and inference. This means that rather than having ML engineers manage each experiment, you can automate the exploration of various configurations using sensible defaults, and then present the most promising candidates to a research engineer or machine learning engineer for further analysis.
flowchart TD
app["Your application"] --Prompt/completion logs--> log_store["Weights & Biases Weave"]
log_store --Datasets--> datasets["NeMo Datastore"]
datasets --"Fine-tuning datasets"--> customizer["NeMo Customizer"]
datasets --"Eval datasets"--> evaluator["NeMo Evaluator"]
subgraph NIMs["Loop across ALL NIMs"]
customizer --"Customized model"--> NIM
evaluator --> NIM
NIM --> evaluator
end
evaluator --> results["Weights & Biases Models"]
In just a few hours, this automated process built on top of NeMo microservices can:
- Pull data from your Weights & Biases Weave logs.
- Group it by task (for example, if you have an agent doing multiple things, each node is a different task).
- De-dup it.
- Create eval and fine-tuning datasets from your production traffic and store them in NeMo Datastore.
- Kick off fine-tuning jobs with NeMo Customizer.
- Run evaluations with LLM-as-judge comparisons on NeMo Evaluator and save the results to Weights & Biases for visualization and analysis.
With reasonable defaults, the system automatically narrows a vast number of possible options down to a manageable set of promising candidates for further analysis—-no manual experiment design required.
👆 This is the key insight of the Data Flywheel Blueprint built on top of the NeMo microservices.
You can scale this process across any number of NIMs by using NeMo Deployment Manager to dynamically start and stop NIMs as needed, so you don't have to keep every NIM running at all times. This cycle can be repeated as frequently as desired: daily, weekly, or on your own schedule.
The strategies in this reference implementation have proven effective in some use cases, but our work is ongoing. Some aspects of this implementation may go against common wisdom:
- Routing production traffic directly into fine-tuning without PII removal
- Building evaluation datasets with no ground truth other than what the production model is responding with
- Not doing any hand-labeling of data
Nonetheless, we've shown it can work. And more importantly, we believe this idea of collecting production traffic prompt/response logs, running automated experiments to explore a huge solution space, and then flagging interesting candidates for further analysis will become an indispensable part of the future GenerativeAI stack.
Therefore, to effectively use this blueprint:
-
Learn from the reference implementation
- Play with the Launchable: Walk through setting up NeMo microservices, deploying the reference services, and exercising the flywheel with the provided sample dataset.
- Read the code & docs: Finish this README and skim the source to understand how the API layer, background tasks, and NeMo microservices integrations fit together.
-
Prepare your traffic
- Instrument production applications: Every distinct LLM call (agent node, route, tool, etc.) must emit a stable
workload_id
. Optionally include a free-form description string—ignored by inference but useful for future workload classification. - Export or connect your logs: If you're already logging prompt/response pairs, write a small connector (or use an existing one) to push them into Weights & Biases Weave or directly into the Flywheel API.
- Instrument production applications: Every distinct LLM call (agent node, route, tool, etc.) must emit a stable
-
Choose how to run the Flywheel
- Always-on service: Keep the stack running in a shared k8s/VM environment so new traffic is evaluated continuously.
- Ad-hoc run: Spin it up locally on a workstation with a few GPUs, load a slice of traffic, kick off a job, and shut everything down when the results are in.
-
Kick off a run
- Load or stream the tagged traffic into Weights & Biases Weave to launch a job. The system will spin up the necessary NIMs, schedule evaluations and fine-tunes, and track everything automatically.
-
Interpret the results
- The response is grouped by NIM. For each NIM the Flywheel currently runs three experiment types: • base – raw production prompts replayed. • icl – few-shot prompts built from random traffic examples. • customized – a LoRA fine-tune evaluated with the base prompts.
- Scores are an LLM-as-judge similarity metric in the range
[0, 1]
. Look for high-scoring small models, then: • Review the linked evaluation runs in W&B to understand model performance • Use W&B's visualization tools to compare different model versions and their metrics
-
Keep the human in the loop
- Think of the Flywheel as a flashlight, not an autopilot. Promotion to production—as well as any deeper evaluation or dataset curation—remains a human decision.
-
Stay up to date
- Follow the repo for UI improvements, new experiment strategies, and future work on accuracy-oriented flywheels.
The Flywheel treats your production prompt / completion logs as the single source of truth. At run-time it only needs to know where to find the logs (e.g. Weights & Biases Weave) and how the individual documents are shaped. Since this is a reference implementation, you can modify the code to suit your needs, but the current schemas are defined below should you decide to conform to them.
Each log entry document must contain the following top-level keys:
Field | Type | Description |
---|---|---|
timestamp |
int (epoch secs) |
Time the request was issued |
workload_id |
str |
Stable identifier for the logical task / route / agent node |
client_id |
str |
Identifier of the application or deployment that generated traffic |
request |
dict |
Exact openai.ChatCompletion.create payload received by the model |
response |
dict |
Exact ChatCompletion response returned by the model |
A minimal example document therefore looks like:
Why this shape? Keeping the full request/response allows the Flywheel to replay prompts, build few-shot demonstrations, and fine-tune without lossy conversions.
Tagging matters •
client_id
is meant to identify who produced the traffic (for example a specific micro-service, environment, or customer). Multiple workloads can share the sameclient_id
.workload_id
is much stricter: it represents a single type of request. If your application is an agent with several nodes you must assign a differentworkload_id
to every node so the Flywheel can evaluate them independently. Treat it as the primary key for slicing, deduplication, and dataset creation.
If you already write request/response logs, you can either route that traffic to Weights & Biases Weave using the Weave client or bulk import them using the provided helper scripts. For new projects the snippet below shows how a synchronous OpenAI call can be wrapped so every interaction is logged in the expected format:
# examples/log_to_weave.py
import os, time
import weave
from openai import OpenAI
from weave.trace.context.weave_client_context import get_weave_client
# Initialize Weights & Biases Weave client
weave.init(project_name="your-project-name")
openai_client = OpenAI()
CLIENT_ID = "my_demo_app"
# Example agent nodes (each with its own workload_id)
WORKLOADS = {
"simple_chat": "agent.chat",
"tool_router": "agent.tool_router",
}
# This is tracks all the chat interactions
@weave.op()
def log_chat(workload_id: str, messages: list[dict], model: str = "gpt-3.5-turbo", temperature: float = 0.3, max_tokens: int = 1024):
# 1) call the LLM
response = openai_client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
# 2) build the document
doc = {
"timestamp": int(time.time()),
"workload_id": workload_id,
"client_id": CLIENT_ID,
"request": {
"model": response.model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
},
"response": response.model_dump(), # OpenAI python-sdk v1 returns a pydantic model
}
# 3) The weave.op decorator automatically logs the document to Weave
return doc
# --- Example usage -----------------------------------------------------------
messages_chat = [{"role": "user", "content": "Hello!"}]
log_chat(WORKLOADS["simple_chat"], messages_chat)
messages_tool = [
{"role": "user", "content": "Who won the 2024 Super Bowl?"},
{
"role": "system",
"content": "You are a router that decides whether to call the Wikipedia tool or answer directly.",
},
]
log_chat(WORKLOADS["tool_router"], messages_tool)
💡 Streaming responses: the OpenAI SDK delivers tokens incrementally in streaming mode. If you are using streaming mode in your clients, you will need to take that into account and either buffer the stream and reconstruct a full response
object before logging, or modify the Flywheel importer to reconstruct the full response.
The reference implementation already bundles helpers you can reuse:
src/scripts/load_test_data_weave.py
– CLI to bulk-load a JSONL file into the Weights & Biases Weave.src/tasks/tasks.py::create_datasets
– Celery task that reads logs, deduplicates them, and turns them into evaluation & training datasets.
Feel free to adjust the logging integration to suit your needs – just make sure create_datasets
can still retrieve documents in the schema above.
The Data Flywheel Foundational Blueprint is a reference implementation of some of the techniques we have explored within NVIDIA. While far from a silver bullet, real-world tests within NVIDIA have found instances where using a Data Flywheel can reduce inference costs by up to 98.6%, while maintaining comparable accuracy. These use cases tend to center around simpler tool calling use cases where an agent is using a tool call to route between a small set of tools. In the 98.6% example, we took traffic from an internal HR chatbot. This chat used an LLM -- llama-3.1-70b-instruct
-- for several purposes:
- Chat
- Query rewriting for RAG
- Summarization
- Tool calling
The Flywheel identified that the Tool Calling use case ended up being simple enough that a fine-tuned llama-3.2-1b-instruct
was able to achieve ~98% accuracy relative to the 70b model being used in production.
We have also found instances where Qwen-2.5-32b-coder
did as well as Llama-3.1-70b-instruct
without any fine-tuning, and so were able to quickly reduce inference costs and time to first token by >50% by swapping out a NIM.
You can also learn more about Flywheels here:
- Enhance Your AI Agent with Data Flywheels Using NVIDIA NeMo Microservices
- Nvidia Releases NeMo Microservices To Streamline AI Agent Development
- Overview of NeMo Microservices
- Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity
- Data Collection and Storage:
- Weights & Biases Weave for logging prompt/completion data and production monitoring
- MongoDB for API and metadata storage
- Redis for task queue management
- Model Integration:
- Support for Meta Llama 3.2 1B Instruct model
- Configurable context length up to 32768 tokens
- Training and Evaluation:
- In-context learning (ICL) with configurable parameters
- LoRA-based fine-tuning support
- Automated data splitting for evaluation
- Deployment Infrastructure:
- Docker Compose setup for development
- Celery workers for background processing
- Health monitoring for core services
- Experiment Tracking:
- Weights & Biases integration for experiment tracking and evaluation results visualization
- Model registry and dataset versioning in Weights & Biases Models
- Rich visualization of results and metrics from evaluations
The Data Flywheel Foundational Blueprint empowers organizations to accelerate the optimization of AI models for cost and performance. Rather than offering a one-size-fits-all solution, this blueprint provides a practical framework and proven tools to guide your journey toward more efficient, production-ready models.
-
Reference Implementation for Real-World Impact This blueprint demonstrates the capabilities of NeMo microservices, providing a foundation you can adapt and extend to meet your specific production requirements.
-
Streamlined Human Oversight The Flywheel process is designed to minimize manual intervention. Human review is reserved for evaluating candidate models after automated assessments, eliminating the need for ongoing user feedback or manual labeling.
-
Cost and Latency Optimization The primary focus is on reducing inference costs and latency by distilling larger models into smaller, high-quality alternatives. Future updates will further enhance accuracy and introduce advanced prompt and agent optimization features.
-
Iterative, Data-Driven Improvement Each iteration provides valuable insights, even when a smaller model is not immediately identified. This iterative approach ensures continuous learning and improvement.
-
Seamless Integration with Existing Workflows
- Designed for teams with existing generative AI applications in production.
- Easily integrates with your current logging and workload tagging practices.
- Supports enhanced workload descriptions for improved future classification.
- Leverages robust infrastructure—including Weights & Biases Weave for production monitoring, Weights & Biases Models for experiment tracking, MongoDB, Redis, and NeMo microservices—to store data, build datasets, run evaluations, fine-tune models, and re-evaluate results.
To get the most value from the Data Flywheel Foundational Blueprint, ensure you have:
- An existing generative AI application in production.
- Logging of prompt/completion traffic, with workload tagging (such as routes, nodes, or agent steps).
- (Optional, but recommended) Descriptive metadata for each workload to support future classification.
- The ability to deploy and operate supporting infrastructure (Weights & Biases Weave, MongoDB, Redis, and NeMo microservices) for data storage, dataset creation, evaluation, and fine-tuning.
By following this blueprint, you can confidently advance your AI model optimization initiatives, leveraging a process that is transparent, adaptable, and focused on measurable outcomes.
The blueprint purposely keeps the first release simple. Areas we are actively exploring for future versions include:
Theme | Example Ideas |
---|---|
Automated Data Collection | Integrated collection of model inputs/outputs, latency, and metadata |
Visualization Dashboards | Enhanced W&B dashboards for cost, latency, drift and accuracy trends |
Agentic Observability & Prompt Insights | Detect regression, drift, or improvement trends based on performance telemetry |
Dynamic Configuration Overrides | Runtime overrides for config.yaml settings via API or environment variables |
Data Governance & Privacy | PII redaction pipeline support for logs and datasets; fine-grained RBAC on dataset access and usage |
Data Governance & Privacy | Enable experiment tracking tooling for granular tracking of fine-tune runs, metrics, artifacts, and config diffs |
Hyper-parameter Sweeps | Support launching NeMo microservices hyper-parameter sweeps from external tools (e.g. Flywheel) and pulling results back for analysis and visualization |
Smarter Dataset Construction | Heuristics or LLM-based parsing to derive eval vs. fine-tune splits from logs; support for DPO/KTO pipelines or filtering by thumbs-up/down signal |
Model & Backend Extensibility | Add support for additional NIMs such as Qwen, LLaMA-Nemotron, and Mistral; include testing and evaluation support for quantized models |
The blueprint consists of the following implemented components:
- API Layer:
- FastAPI-based REST endpoints (
src/api/endpoints.py
) - Data models and schemas (
src/api/models.py
,src/api/schemas.py
) - Job service for task management (
src/api/job_service.py
)
- FastAPI-based REST endpoints (
- Data Storage:
- Weights & Biases Weave for log storage and production monitoring
- Weights & Biases Models for experiment tracking
- MongoDB for API data persistence (
src/api/db.py
) - Redis for task queue
- Task Processing:
- Celery workers for background jobs (
src/tasks/tasks.py
) - Configurable concurrency and monitoring
- Celery workers for background jobs (
- NeMo Microservices Integration:
- Datastore client for dataset management
- Model evaluation and customization interfaces
- Configurable NeMo microservices endpoints
- Experiment Tracking:
- Weights & Biases integration for tracking experiments and visualizing evaluation results (
src/lib/integration/wandb_callback.py
) - Model registry for versioning and tracking (
src/lib/nemo/model_manager.py
)
- Weights & Biases integration for tracking experiments and visualizing evaluation results (
For details on the architecture of a Flywheel and the components of this Blueprint, view the Architecture Overview.
Requirement Type | Details |
---|---|
Minimum GPU | Self-hosted LLM Judge: 6× (NVIDIA H100 or A100 GPUs) Remote LLM Judge: 2× (NVIDIA H100 or A100 GPUs) |
Cluster | Single-node NVIDIA GPU cluster on Linux with cluster-admin permissions |
Disk Space | At least 200 GB free |
Software | Python 3.11 Docker Engine Docker Compose v2 |
Services | Weights & Biases Weave MongoDB 7.0 Redis 7.2 FastAPI (API server) Celery (task processing) |
Resource | Minimum Memory: 1GB Storage: Varies by log volume/model size Network: Ports 8000 (API), 27017 (MongoDB), 6379 (Redis) |
Development | Docker Compose for local dev with hot reloading Supports macOS (Darwin) and Linux Optional: GPU support for model inference |
Production | Kubernetes cluster (recommended) Resources scale with workload Persistent volume support for data storage |
Why only one Flywheel run at a time? When the Flywheel kicks off a run it may need to spin up multiple NIMs and customization jobs, each of which can claim one or more GPUs. The reference implementation does not yet discover the number of free GPUs in the cluster, so it uses a simple guardrail: all invocations of run_nim_workflow_dag
are serialized.
- The task is bound to a dedicated Celery queue (
parent_queue
). In thedocker-compose.yaml
there is a worker dedicated to this queue whose concurrency is set to1
. There is a second worker bound to the defaultcelery
queue which can handle running other tasks (e.g. evals) in parallel. - Inside the task we wait for the full DAG to complete via
async_result.get(...)
before returning. - The call to create a job (i.e.
POST /jobs
) will not block, however. It will return immediately with a Job ID
This ensures that only one Flywheel experiment can allocate GPUs at any given time, preventing accidental overallocation that would lead to failed NIM deployments or customizations.
Roadmap – Automatic GPU introspection and smarter scheduling are planned for a future version of the Blueprint so multiple Flywheel runs can execute in parallel when resources permit.
The blueprint provides comprehensive integration with Weights & Biases:
- Model Artifact Management
- Automatic logging of fine-tuned models and LoRA adapters
- Version tracking for all model artifacts
- Metadata storage including model specs, configurations, and performance metrics
- Linking of evaluation runs to model artifacts for complete lineage
- Dataset Management
- Serialization of datasets from Weave traces
- Version control for training, validation, and evaluation datasets
- Automatic dataset splitting and formatting
- Support for both raw and processed dataset versions
- Experiment Tracking
- Detailed logging of evaluation metrics and scores
- Customizable tags and metadata for experiment organization
- Real-time progress monitoring and visualization
- Integration with W&B's experiment comparison tools
- Production Monitoring
- Comprehensive logging of production traffic
- Performance metrics tracking
- Automatic dataset creation from production logs
- Support for custom metrics and visualizations
- Review the Architecture Overview
- Follow the Quickstart Guide to deploy this blueprint
- Explore the full API Specification to understand all available endpoints
- Read the Audience Guide to understand stakeholder responsibilities
- Review Limitations & Best Practices before promoting any model
- Review the Evaluation Types and Metrics to understand available evaluation types.
The following are some of the customizations that you can make after you complete the steps in the Quickstart Guide.
Category | Description | Available Options |
---|---|---|
Environment Variables | Configure system using environment variables | • Required Variables: NGC_API_KEY, HF_TOKEN, WANDB_API_KEY • Optional Variables: WANDB_PROJECT, WANDB_ENTITY, MONGODB_URL, REDIS_URL • Configuration: Via .env file or system environment |
Model Integration | Configure and deploy LLM models | • Currently Supported: Meta Llama 3.2 1B Instruct • Context Length: Up to 32768 tokens • Hardware Config: GPU support (configurable), PVC size (configurable) • Version Control: Model tags supported |
Evaluation Settings | Configure data splitting and evaluation parameters | • Data Split: Eval size (default: 20), validation ratio (0.1) • Minimum Records: 50 records required • Reproducibility: Optional random seed • ICL Settings: Context length (max 32768), reserved tokens (4096), examples (min 1, max 3) |
Fine-tuning Options | Customize model training | • Training Type: SFT (Supervised Fine-Tuning) • Method: LoRA with configurable parameters • Parameters: epochs (2), batch size (16), learning rate (0.0001) • LoRA Config: adapter dimension (32), dropout (0.1) |
Data Infrastructure | Configure data storage and processing | • Storage: Weights & Biases Weave for logs • Queue: Redis for task processing • Database: MongoDB for API data • Processing: Celery workers with configurable concurrency |
Experiment Tracking | Configure experiment tracking | • Weights & Biases: Project name, entity, API key • Model Registry: Tracking model artifacts • Metrics: Customizable metrics logging for evaluation results |
Deployment Options | Infrastructure configuration | • Development: Docker Compose with hot reloading • Services: API, Celery Worker, Redis, MongoDB • Resource Config: Network mode, volume mounts, health checks • Environment: Configurable URLs and API keys |
W&B Configuration | Configure W&B integration | • Project Settings: Project name, entity, API key • Artifact Management: Model and dataset versioning • Experiment Tracking: Custom metrics and visualizations • Production Monitoring: Traffic logging and analysis |
Refer to the Configuration Guide for more information.
-
Install development dependencies:
uv sync --dev
This command installs all dependencies needed to build the container and run the tests.
-
Start required services:
./scripts/run.sh
This starts the necessary services via docker compose that are required for testing.
-
Run the tests:
-
For unit tests (requires MongoDB from docker compose):
uv run pytest
-
For integration tests (with mocked NeMo microservices components):
uv run pytest -m integration
-
-
Clean up after development:
-
Stop all services:
./scripts/stop.sh
-
(Optional) Clear all database volumes:
./scripts/clear_all_volumes.sh
-
If you modify the API, regenerate the openapi.json with the following command:
uv run python scripts/generate_openapi.py
This NVIDIA AI BLUEPRINT is licensed under the Apache License, Version 2.0. This project will download and install additional third-party open source software projects and containers. Review the license terms of these open source projects before use.
The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/), except that models are governed by the AI Foundation Models Community License Agreement (found at NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License) and NVIDIA dataset is governed by the NVIDIA Asset License Agreement found here.
For Meta/llama-3.1-70b-instruct model the Llama 3.1 Community License Agreement, for nvidia/llama-3.2-nv-embedqa-1b-v2model the Llama 3.2 Community License Agreement, and for the nvidia/llama-3.2-nv-rerankqa-1b-v2 model the Llama 3.2 Community License Agreement. Built with Llama.
The Data Flywheel Blueprint is shared as reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities,secure the communication channels, integrate AuthN & AuthZ with appropriate controls.