Below is a structured, 30-hour (1 hour/day for ~1 month) curriculum designed to get you from basic ONNX exports to deploying a Transformer-based TTS model in the browser. Each “stage” corresponds to a set of consecutive daily tasks. The tasks are broken down so that you can typically complete each in about one hour of focused effort.
- Stage 1 (Days 1–6): ONNX Fundamentals with a Simple MLP
- Stage 2 (Days 7–12): Convolutional Model Export (ResNet) and Optimization
- Stage 3 (Days 13–18): Transformer Basics (DistilBERT) and Dynamic Axes
- Stage 4 (Days 19–24): TTS Pipeline Components (Tacotron2/Glow-TTS & Vocoder)
- Stage 5 (Days 25–30): Deploying TTS in the Browser (ONNX Runtime Web)
By the end, you will have all the key skills—exporting various model types (MLP, CNN, Transformer, TTS) to ONNX, verifying in Python, optimizing, and finally running inference in a JavaScript browser environment.
Goal
- Understand ONNX basics: how to export a simple model, load it in Python (ONNX Runtime), and verify correctness.
Suggested Model
- A small feed-forward MLP in PyTorch (2–3 Linear layers, ReLU activations).
- Install & Setup:
- Install PyTorch,
onnx
,onnxruntime
, and a visualization tool like Netron. - Skim the PyTorch ONNX Export Tutorial to see the basic workflow.
- Install PyTorch,
- Implement/Load an MLP:
- In PyTorch, code a simple MLP model (e.g., input=10, hidden=20, output=2).
- Optionally train briefly on random data to confirm it runs (or just define the model architecture).
- Export to ONNX:
- Use
torch.onnx.export(...)
with a dummy input tensor of the correct shape. - Save the exported model as
mlp.onnx
. - Pay attention to
opset_version
andinput_names
,output_names
,dynamic_axes
if you plan to handle variable batch sizes.
- Use
- Verify in ONNX Runtime:
- Load
mlp.onnx
in Python withonnxruntime.InferenceSession(...)
. - Run inference on random inputs and compare output to your original PyTorch model.
- Load
- Debug & Validate:
- Use
onnx.checker.check_model(...)
to ensure validity. - Visualize the ONNX graph in Netron to see how your layers have been exported.
- Use
- Recap & Document:
- Summarize any issues you encountered (e.g., shape mismatches).
- Confirm you understand each step: define → export → load → verify.
End-of-Stage 1 Outcome
- You can confidently export a basic feed-forward network to ONNX and verify it in Python.
Goal
- Learn to handle more complex ops (convolutions, batch norm, etc.) and explore basic ONNX optimization.
Suggested Model
- ResNet18 from TorchVision (pretrained on ImageNet).
- Compare your exported model to the ONNX Model Zoo’s ResNet for reference.
- Load Pretrained ResNet:
resnet18 = torchvision.models.resnet18(pretrained=True)
in PyTorch.- Test a quick inference in Python to confirm it works.
- Export to ONNX:
- Similar steps as before:
torch.onnx.export(...)
. - Use dynamic axes for the batch dimension if desired.
- Save as
resnet18.onnx
.
- Similar steps as before:
- Compare to Official ONNX Model:
- Download the official
resnet18
(orresnet50
) from the ONNX Model Zoo. - In Netron, compare the structure/ops.
- Verify outputs with the same test image to check for close numerical matches.
- Download the official
- Shape Inference & Simplification:
- Install/use onnx-simplifier to reduce graph complexity.
onnxruntime
can also do some optimizations. Exploreonnx.shape_inference.infer_shapes(...)
.
- Optional: Quantization:
- Investigate dynamic or static quantization.
- Compare the performance difference in ONNX Runtime between the original and a quantized model.
- Performance & Summary:
- Measure inference speed in Python for the original vs. simplified vs. quantized (if done).
- Document your findings.
End-of-Stage 2 Outcome
- You can export a pretrained ResNet to ONNX, compare to a reference, simplify it, and potentially improve runtime performance.
Goal
- Get comfortable exporting Transformer architectures.
- Handle dynamic sequence lengths and advanced ops like Multi-Head Attention.
Suggested Model
- DistilBERT from Hugging Face Transformers.
- Smaller and simpler than full BERT, but representative of typical Transformer ops.
- Set Up Hugging Face Transformers:
- Install
transformers
via pip. from transformers import DistilBertModel, DistilBertTokenizer
- Load the pretrained
distilbert-base-uncased
model and tokenizer.
- Install
- Test Inference in Python:
- Tokenize a sample sentence (e.g.
"Hello world!"
). - Pass it through DistilBERT in PyTorch.
- Inspect output shapes (last hidden states, etc.).
- Tokenize a sample sentence (e.g.
- Export to ONNX:
- Use
transformers.onnx
CLI (python -m transformers.onnx
) or manually calltorch.onnx.export(...)
. - Ensure you specify dynamic axes for the token dimension so it can handle variable sentence lengths.
- Use
- Verify with ONNX Runtime:
- Compare the DistilBERT ONNX inference outputs with the PyTorch outputs for the same inputs.
- Check for numerical closeness (small floating-point differences are normal).
- Optimize:
- Use onnxruntime.transformers to fuse attention and layer norm subgraphs.
- Measure speedups with
onnxruntime.InferenceSession(..., providers=["CPUExecutionProvider"])
or GPU if available.
- Wrap Up & Document:
- Summarize the export process.
- Note any important flags for Transformers (e.g., opset version, sequence length constraints).
End-of-Stage 3 Outcome
- You have a working DistilBERT ONNX model with dynamic axes, tested in Python, and optionally optimized.
Goal
- Learn how TTS is often split into two or more models (text → mel spectrogram → waveform).
- Export and chain TTS models in ONNX.
Suggested Models
- Tacotron 2 or Glow-TTS for acoustic model (mel-spectrogram generation).
- WaveGlow or HiFi-GAN for the vocoder (mel-spectrogram → waveform).
You can find pretrained checkpoints and partial ONNX export references in:
- Choose & Download Pretrained Models:
- For example, a pretrained Tacotron 2 model and a WaveGlow model from NVIDIA’s GitHub.
- Familiarize yourself with the TTS pipeline: text input → phoneme/text encoder → mel spectrogram → audio.
- Acoustic Model Export:
- If using Tacotron 2, adapt a known ONNX export script or write your own
torch.onnx.export(...)
. - Pay attention to dynamic shapes (the length of text tokens, output mel frames).
- If using Tacotron 2, adapt a known ONNX export script or write your own
- Vocoder Export:
- Export WaveGlow or HiFi-GAN to ONNX.
- Again, manage dynamic shapes if needed (the number of mel frames can vary).
- Chain in Python:
- Run text → mel with your ONNX acoustic model.
- Pass the resulting mel spectrograms into your ONNX vocoder to get waveforms.
- Save the waveforms to a
.wav
file to confirm correctness.
- Compare with Original Models:
- Use the same text input on the original PyTorch models.
- Compare the mel-spectrogram and final audio for any major discrepancies.
- Refine & Document:
- Note any performance issues, shape alignment pitfalls, or memory usage considerations.
- If the pipeline is stable, you have your TTS modules in ONNX!
End-of-Stage 4 Outcome
- You have a working two-step TTS pipeline (acoustic + vocoder) fully exported to ONNX and tested in Python.
Goal
- Learn to run ONNX models entirely in JavaScript/TypeScript via WebAssembly/WebGL.
- Integrate TTS inference into a simple web app that can play audio.
Key Tools
- ONNX Runtime Web
- Web Audio API for playback.
- Set Up a Simple Web Project:
- Initialize an npm project.
- Install
onnxruntime-web
(oronnxruntime-web-gpu
if you want WebGL). - Start with a plain HTML + JS or a small React/Vue app.
- Test a Simple ONNX Model:
- First, try loading the
resnet18.onnx
from Stage 2 in the browser to confirm your environment is configured. - Use an image input (or random data) to verify it returns an inference result.
- First, try loading the
- Load Acoustic & Vocoder Models:
- Copy over your TTS
.onnx
files from Stage 4 into the web app’s static/public folder. - Load them with ONNX Runtime Web (
session = await InferenceSession.create('tacotron2.onnx')
, etc.).
- Copy over your TTS
- Implement the Text-to-Mel Step:
- Tokenize/encode text in JavaScript (a simple approach or a precompiled JSON mapping) to pass into the acoustic model.
- Receive mel spectrogram as an output
Float32Array
.
- Waveform Generation & Audio Playback:
- Pass the mel spectrogram to the vocoder session.
- Convert the output (waveform) to a playable audio buffer.
- Use the Web Audio API to play the generated audio in the browser.
- Optimize & Finalize:
- Measure performance. If it’s slow, consider smaller models, quantization, or WebGPU.
- Confirm end-to-end TTS (text → audio) runs within a reasonable time.
- (Optional) Add a simple UI (text input box and “Speak” button).
End-of-Stage 5 Outcome
- You have a functional in-browser TTS demo using ONNX Runtime Web. Users type in text, your acoustic + vocoder models generate audio, and the browser plays it back.
-
Model-Specific Guides
- ONNX Model Zoo has many reference models and example code to compare against.
- Hugging Face Transformers ONNX Docs for additional tips on exporting advanced Transformers.
-
Common Pitfalls
- Operator Support in the browser can be limited; check if your TTS model uses any unsupported ops in WASM or WebGL.
- Dynamic Axes: Ensure you carefully define them when exporting; TTS often needs variable sequence lengths.
- Large Graphs can be slow in JavaScript. Investigate quantization or smaller architectures if real-time is desired.
-
Time Management
- Each day’s task should be doable in roughly one hour, but if you find yourself needing more time for debugging, feel free to adapt.
- Be sure to document each step so you can look back on your progress.
Following this daily plan (1 hour/day for about a month) will steadily build your expertise. You’ll start with simple ONNX exports, then tackle increasingly complex models, culminating in a Transformer-based TTS pipeline deployed in the browser. Good luck!