Skip to content

e-c-k-e-r/tortoise-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TorToiSe TTS

An unofficial PyTorch re-implementation of TorToise TTS.

Almost all of the documentation and usage are carried over from my VALL-E implementation, as documentation is lacking for this implementation, as I whipped it up over the course of two days using knowledge I haven't touched in a year.

Requirements

A working PyTorch environment.

  • python3 -m venv venv && source ./venv/bin/activate is sufficient.

Install

Simply run pip install git+https://git.ecker.tech/mrq/tortoise-tts@new or pip install git+https://github.com/e-c-k-e-r/tortoise-tts.

Usage

Inferencing

Using the default settings: python3 -m tortoise_tts --yaml="./data/config.yaml" "Read verse out loud for pleasure." "./path/to/a.wav"

To inference using the included Web UI: python3 -m tortoise_tts.webui --yaml="./data/config.yaml"

  • Pass --listen 0.0.0.0:7860 if you're accessing the web UI from outside of localhost (or pass the host machine's local IP instead)

A LoRA can be loaded by appending --lora=./path/to/your/lora.sft for either above commands.

Training / Finetuning

Training is as simple as copying the reference YAML from ./data/config.yaml to any training directory of your choice (for examples: ./training/ or ./training/lora-finetune/).

Dataset

A pre-processed dataset is required. Refer to the VALL-E implementation for more details. But to reiterate:

  1. Populate your source voices under ./voices/{group name}/{speaker name}/.

  2. Run python3 -m tortoise_tts.emb.transcribe. This will generate a transcription with timestamps for your dataset.

  3. Run python3 -m tortoise_tts.emb.process. This will phonemize the transcriptions and quantize the audio.

  4. Whever you copied the ./data/config.yaml, populate cfg.dataset.training with strings {group name}/{speaker name}.

  5. Either copy, move, or symlink the resultant ./training/24KHz-mel/ folder to the directory containing your copied config.yaml as data.

  6. Run python3 -m tortoise_tts.data --yaml="./path/to/your/training/config.yaml --action=metadata to generate additional metadata, as the dataloader code is slop and needs to be updated.

Trainer

To start the trainer, run python3 -m tortoise_tts.train --yaml="./path/to/your/training/config.yaml.

  • Type save to save whenever. Type quit to quit and save whenever. Type eval to run evaluation / validation of the model.

For training a LoRA, uncomment the loras block in your training YAML.

For loading an existing finetuned model, create a folder with this structure, and load its accompanying YAML:

./some/arbitrary/path/:
    ckpt:
        autoregressive:
            fp32.pth # finetuned weights
    config.yaml

For LoRAs, replace the above fp32.pth with lora.pth.

To-Do

  • Validate everything actually works still because dependencies break things over time
  • Re-backport all the creature comforts from VALL-E
  • Reimplement original inferencing through TorToiSe (as done with api.py)
    • Reimplement candidate selection with the CLVP
    • Reimplement redaction with the Wav2Vec2
  • Implement training support (without DLAS)
    • Feature parity with the VALL-E training setup with preparing a dataset ahead of time
  • Automagic offloading to CPU for unused models (for training and inferencing)
  • Automagic handling of the original weights into compatible weights
  • Reimplement added features from my original fork:
    • "Better" conditioning latents calculating
    • Use of KV-cache for the AR
    • Re-enable DDIM sampler
  • Extend the original inference routine with additional features:
    • non-float32 / mixed precision for the entire stack
      • Parts of the stack will whine about mismatching dtypes...
    • BitsAndBytes support
      • Provided Linears technically aren't used because GPT2 uses Conv1D instead...
    • LoRAs
    • Web UI
      • Feature parity with ai-voice-cloning
        • Although I feel a lot of its features are the wrong way to go about it.
    • Additional samplers for the autoregressive model (such as mirostat / dynamic temperature)
    • Additional samplers for the diffusion model (beyond the already included DDIM)
    • BigVGAN in place of the original vocoder
      • HiFiGAN integration as well
    • XFormers / flash_attention_2 for the autoregressive model
      • Beyond HF's internal implementation of handling alternative attention
      • Both the AR and diffusion models also do their own attention...
    • Saner way of loading finetuned models / LoRAs
    • Some vector embedding store to find the "best" utterance to pick
  • Documentation
    • this also includes a correct explanation of the entire stack (rather than the poor one I left in ai-voice-cloning)

Why?

To:

  • atone for the mess I've made with forking TorToiSe TTS originally with a bunch of slopcode, and the nightmare that ai-voice-cloning turned out.
  • unify the trainer and the inference-er.
  • implement additional features with much ease, as I'm very well familiar with my framework.
  • disillusion myself that it won't get better than TorToiSe TTS:
    • while it's faster than VALL-E, the quality leaves a lot to be desired (although this is simply due to the overall architecture).

License

Unless otherwise credited/noted in this README or within the designated Python file, this repository is licensed under AGPLv3.

About

An unofficial re-implementation of the audio LM TorToiSe

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published