This project fine-tunes the StyleTTS2 model on the LibriTTS dataset to produce high-quality, expressive, and controllable speech synthesis. It supports multispeaker speech generation, style control, and can be extended for applications like voice cloning, TTS APIs, and conversational agents.
- Finetuning StyleTTS2 on LibriTTS with second-stage training
- Multispeaker support
- Style embedding via diffusion model
- ASR and F0 integration
- Accelerated with mixed precision (fp16)
- Dockerized training pipeline
- Checkpoint management with AWS S3
- AWS EC2 (with GPU, e.g.,
g4dn.xlarge
) - NVIDIA Docker runtime (
--gpus all
) - Docker image built or pulled from ECR
- AWS CLI configured with access to an S3 bucket
- Checkpoints from base model (e.g.,
epochs_2nd_00020.pth
)
# Run the Docker container (replace with your own image name)
docker run --gpus all -d --name styletts2-container <your-docker-image>
# Access the container
docker exec -it styletts2-container bash
# Customize Configs/config_ft.yml:
log_dir: "Models/LibriTTS"
epochs: 35
batch_size: 2
pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth"
load_only_params: true
...
accelerate launch --mixed_precision=fp16 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml