A curated roadmap based on my 5 years of experience form zero to become a skilled AI Speech Engineer. 🚀👨💻
This roadmap covers everything from fundamentals to cutting-edge research trends in the speech domain.
Phase | Duration | Focus Areas |
---|---|---|
🧠 Foundations | 3 months | Math, Python, Machine Learning, Deep Learning, Signal Processing |
💼 Tools & Frameworks | 3 months | Libraries, Audio Tools, Hugging Face |
🌱 Core Technologies | 12 months | ASR, TTS, Speaker Verification & Diarization |
🔬 Research Trends | Continuous | Audio-Language Models |
- 📺 1. what is a neural network?
- 📺 2. Gradient descent, how neural networks learn
- 📺 3. Backpropagation, intuitively
- 📺 4. Backpropagation calculus
PyTorch
- Training models frameworklibrosa
- Audio preprocessing (STFT, MFCCs, etc.)torchaudio
- Audio loading, transforms, and model wrappersffmpeg
,sox
,pydub
- Audio conversion, slicing, format handlingnoisereduce
– Simple noise reduction from raw audio
- Audacity - A free & powerful software for editing & visualizing audio
- Audacity Tutorial
- Hugging Face Audio - Learn to tackle a range of audio-related tasks and gain experiments with speech datasets.
- SpeechBrain ASR
- SpecAugment
- Generation of large-scale simulated utterances in virtual rooms...
- Illustrated Wav2Vec2
- Sequence Modeling With CTC
- Wav2Vec2
- Whisper
- Fast Conformer
- My graduation thesis (Vietnamese) (2021)
- HMM-based Vietnamese TTS
- Wavenet: A Generative Model for Raw Audio (2016)
- Tacotron: Towards End-to-End Speech Synthesis (2017)
- WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
- FastSpeech 1: Fast, Robust and Controllable Text to Speech (2020)
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2021)
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2020)
- VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (2021)
- JETS: JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech (2022)
- NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022)
- Speech Verification Introduction
- X-vector Paper
- I-vector Paper
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation...
- VoxCeleb: a large-scale speaker identification dataset
- ResNeXt and Res2Net Structures for Speaker Verification
- Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification
- CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking
- RedimNet: Reshape Dimensions Network for Speaker Recognition
- 3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus...
- ERes2NetV2: Boosting Short-Duration Speaker...
- Speaker Diarization: An Introductory Overview
- Speaker Diarization: From Traditional Methods to the Modern Models
- pyannote.audio: neural building blocks for speaker diarization
- A Review of Speaker Diarization: Recent Advances with Deep Learning
- Comparing state-of-the-art speaker diarization frameworks : Pyannote vs Nemo
- Multi-scale Speaker Diarization with Dynamic Scale Weighting
- DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
- Sortformer: Seamless Integration of Speaker Diarization and ASR...
- Streaming Sortformer: Speaker Cache-Based Online ...
- On The Landscape of Spoken Language Models: A Comprehensive Survey
- Recent Advances in Speech Language Models: A Survey
- Audio-Language Models for Audio-Centric Tasks: A survey
- CosyVoice: A Scalable Multilingual Zero-shot Text to Speech...
- F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
- FunAudioLLM: Voice Understanding and Generation Foundation Models...
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale...
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable...