Audiobook Maker v3

This application utilizes open-source deep-learning text-to-speech and speech-to-speech models to create audiobooks. The main goal of the project is to be able to seamlessly create high-quality audiobooks by using these advancements in machine learning/AI.

It's designed for Windows, but pyside6 should be able to run on linux.

Features

✔️ Multi-speaker/engine generation, allowing you to select who speaks which sentence etc.

✔️ Audio playback of individually generated sentences, or playback all to listen as it generates

✔️ Save in place to continue generating later (continue from where you stopped at)

✔️ Bulk sentence regeneration and editting to regenerate audio for a sentence or change which speaker and/or engine is being used for a sentence

✔️ Reloading previous audiobooks and exporting audiobooks

✔️ Sentence remapping in case you need to update the original text file that was used for generation

✔️ Integration with popular open-source models like TortoiseTTS, RVC, StyleTTS, F5TTS, XTTS (to be added) and GPT-SoVITS

Windows Package Installation

Available for Youtube Channel Members at the Supporter (Package) level: https://www.youtube.com/channel/UCwNdsF7ZXOlrTKhSoGJPnlQ/join or via purchase here: https://buymeacoffee.com/jarodsjourney/extras

Pre-requisites

NVIDIA GPU with at least 8GB of VRAM (for heavier inference models like Tortoise, 4-6 GB might be possible as we're not training here)
Please install the CUDA DEV toolkit here, else CUDA_HOME error will occur for RVC: https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_local

Download the zip file provided to you on the members community tab.
Unzip the folder
To get StyleTTS, double-click and run finish_styletts_install.bat
Run the start.bat file

And that's it! (maybe)

For F5 TTS, an additional download will be incurred when you first use it due to licensing of the pretrained base model being cc-by-nc-4.0

Manual Installation Windows 10/11

Pre-requistites

Python 3.11: https://www.python.org/downloads/release/python-3119/
git: https://git-scm.com/
vscode (optional): https://code.visualstudio.com/
ffmpeg: https://www.ffmpeg.org/download.html#build-windows
- Watch a tutorial here: https://www.youtube.com/watch?v=JR36oH35Fgg&t=159s&ab_channel=Koolac
NVIDIA GPU with at least 8GB of VRAM (for heavier inference models like Tortoise, 4-6 GB might be possible as we're not training here)
Install CUDA toolkit, see issue: #63 (comment)

GUI Installation

Clone the repository and cd into it.

git clone https://github.com/JarodMica/audiobook_maker.git
cd audiobook_maker

Create a venv in python 3.11 and then activate it. If you can't activate the python venv due to restricted permissions: https://superuser.com/questions/106360/how-to-enable-execution-of-powershell-scripts
```
py -3.11 -m venv venv
.\venv\Scripts\activate
```
Install basic requirements to get the GUI opening
```
pip install -r .\requirements.txt
```

Pull submodules

git submodule init
git submodule update

Launch the interface
```
python .\src\controller.py
```
(Optional) I recommend you create a batch script to launch the gui instead of manually doing it each time. Open notepad, throw the code block below into it, name it start.bat, and it should be fine. Make sure your extensions are showing so that it's not start.bat.txt
```
call venv\Scripts\activate
python src\controller.py
```

Congrats, the GUI can be launched! You should see in the errors in the terminal such as Tortoise not installed or RVC not installed

If you use it like this, you will only be able to use pyttsx3. To install additional engines, refer to the sections below to get the engines you want installed, I recommend you do all of them.

Text-to-Speech Engines

TortoiseTTS Installation

Make sure your venv is still activated, if not, activate it, then pull the repo to update if you are updating an older install:
```
.\venv\Scripts\activate
```

Change directory to tortoise submodule, then pull its submodules:

cd .\modules\tortoise_tts_api\
git submodule init
git submodule update

Install the submodules:

pip install modules\tortoise_tts
pip install modules\dlas

Install the tortoise tts api repo, then cd back to root:
```
pip install .
cd ..\..
```
Ensure requirements are at the versions they need to be at:
```
pip install -r requirements.txt
```
Ensure you have pytorch installed with CUDA enabled Check Torch Install

StyleTTS 2 Installation

Make sure your venv is still activated, if not, activate it, then pull the repo to update if you are updating an older install:
```
.\venv\Scripts\activate
```

Change directory to styletts submodule, then pull its submodules:

cd .\modules\styletts-api\
git submodule init
git submodule update

Install the submodules:
```
pip install modules\StyleTTS2
```
Install the styletts api repo, then cd back to root:
```
pip install .
cd ..\..
```
Install monotonic align with the precompiled wheels that I've built here, put in the repo root, and run the below command. Will NOT work if you wanna use a different version of python:
```
pip install monotonic_align-1.2-cp311-cp311-win_amd64.whl
```
- Alternatively, if you are running a different python version, you will need microsoft c++ build tools to install it yourself: https://visualstudio.microsoft.com/downloads/?q=build+tools
```
pip install git+https://github.com/resemble-ai/monotonic_align.git@78b985be210a03d08bc3acc01c4df0442105366f
```
Get eSpeak-NG files and base STTS2 model by running the finish_styletts_install.bat:
```
.\finish_styletts_install.bat
```
- Alternatively, install eSpeak-NG onto your computer. Head over to https://github.com/espeak-ng/espeak-ng/releases and select the espeak-ng-X64.msi the assets dropdown. Download, run, and follow the prompts to set it up on your device. As of this write-up, it'll be at the bottom of 1.51 on the github releases page
  - You will also need to add the following to your envrionment path:
```
PHONEMIZER_ESPEAK_LIBRARY="c:\Program Files\eSpeak NG\libespeak-ng.dll"
PHONEMIZER_ESPEAK_PATH =“c:\Program Files\eSpeak NG”
```
Ensure requirements are at the versions they need to be at:
```
pip install -r requirements.txt
```
Ensure you have pytorch installed with CUDA enabled Check Torch Install

F5-TTS Installation

Make sure your venv is still activated, if not, activate it, then pull the repo to update if you are updating an older install:
```
.\venv\Scripts\activate
```
Install the F5-TTS submodule as a package:
```
pip install .\modules\F5-TTS
```
Ensure requirements are at the versions they need to be at:
```
pip install -r requirements.txt
```
Ensure you have pytorch installed with CUDA enabled Check Torch Install

GPT-SoVITS Installation

Make sure your venv is still activated, if not, activate it, then pull the repo to update if you are updating an older install:
```
.\venv\Scripts\activate
```

Install the GPT-SoVITS-Package submodule:

pip install .\modules\GPT-SoVITS-Package\

Install nltk requirements:
```
python install_gpt_sovits_nltk.py
```
Ensure requirements are at the versions they need to be at:
```
pip install -r requirements.txt
```
Inside of , GPT-SoVITS base models will automatically be downloaded when first starting a generation. Anytime there is a new update to the remote HF repo, it will download new files. This behavior can be disabled by turning auto_download_gpt_sovits inside of config\setting.yaml to False instead of True.
Ensure you have pytorch installed with CUDA enabled Check Torch Install

Speech-to-Speech Engines

RVC Installation

Make sure your venv is still activated, if not, activate it, then pull the repo to update if you are updating an older install:
```
.\venv\Scripts\activate
```
Install fairseq as a wheels file. Download it from this link here https://huggingface.co/Jmica/rvc/resolve/main/fairseq-0.12.4-cp311-cp311-win_amd64.whl?download=true and place it in the audiobook_maker :
```
pip install .\fairseq-0.12.4-cp311-cp311-win_amd64.whl
```
It's done this way due to issues with fairseq on python 3.11 and above so I've compiled a wheels file for you to use. You can delete it afterwards if you want.
Install the rvc-python library:
```
pip install .\modules\rvc-python\
```
Ensure requirements are at the versions they need to be at:
```
pip install -r requirements.txt
```
Ensure you have pytorch installed with CUDA enabled Check Torch Install

Check Torch Install

Sometimes, torch may be re-installed from other dependencies, so we want to be sure we're on the right version.

Check torch version:

pip show torch

As long as torch Version: 2.7.0+cu128, you should be fine. If not, follow below:

Blackwell GPUs (NVIDIA 50 series) need pytorch 2.7.0 or higher with CUDA 12.8 or above

pip uninstall torch -y
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

Torch is a pretty large download, so it may take a bit of time. Once you have it installed here, it should be fine following the other install. However, sometimes, newer versions of torch may uninstall the one we just did, so you may need to uninstall and reinstall after each engine to make sure you have the correction version. After the first install, it will have been cached, so you won't have to wait each time afterwards.

Updating the Package

If there are updates to the Audiobook Maker, you may need to pull new files from the source repo in order to gain access to new functionality.

Open up a terminal in the Audiobook Maker folder (if not openned alread) and run:
```
git pull
git submodule update
```

If you run into issues where you can't pull the updates, you may have made edits to the code base. In this case, you will need to stash your updates so that you can pull it. I won't go over how you can reapply custom mods as that dives into git conflicts etc.

git stash
git pull
git submodule update

Usage

To be written

Acknowledgements

This has been put together using a variety of open-source models and libraries. Wouldn't have been possible without them.

TTS Engines:

Tortoise TTS: https://github.com/neonbjb/tortoise-tts
StyleTTS: https://github.com/yl4579/StyleTTS2
F5TTS: https://github.com/SWivid/F5-TTS/tree/main
GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS

S2S Engines:

RVC: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
- Installable RVC Library: https://github.com/daswer123/rvc-python

Licensing

Each engine being used here is MIT or Apache-2.0. However, base-pretrained models may have their own licenses or use limitations so please be aware of that depending on your use case. I am not a lawyer, so I will just state what the licenses are.

StyleTTS 2

The pretrained model states:

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

F5 TTS

The pretrained base was trained on the Emilia dataset, so it is Non-Commerical CC-By-NC-4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
configs		configs
engines		engines
image_backgrounds		image_backgrounds
input_text_files		input_text_files
models		models
modules		modules
src		src
voices		voices
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
base.css		base.css
bugs_and_feature_tracker.md		bugs_and_feature_tracker.md
changelog.md		changelog.md
finish_styletts_install.bat		finish_styletts_install.bat
install_gpt_sovits_nltk.py		install_gpt_sovits_nltk.py
requirements.txt		requirements.txt
up.bat		up.bat
update_package.bat		update_package.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audiobook Maker v3

Table of Contents

Install Specific Engines

Features

Windows Package Installation

Pre-requisites

Manual Installation Windows 10/11

Pre-requistites

GUI Installation

Text-to-Speech Engines

TortoiseTTS Installation

StyleTTS 2 Installation

F5-TTS Installation

GPT-SoVITS Installation

Speech-to-Speech Engines

RVC Installation

Check Torch Install

Updating the Package

Usage

Acknowledgements

Licensing

StyleTTS 2

F5 TTS

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 2

Languages

License

JarodMica/audiobook_maker

Folders and files

Latest commit

History

Repository files navigation

Audiobook Maker v3

Table of Contents

Install Specific Engines

Features

Windows Package Installation

Pre-requisites

Manual Installation Windows 10/11

Pre-requistites

GUI Installation

Text-to-Speech Engines

TortoiseTTS Installation

StyleTTS 2 Installation

F5-TTS Installation

GPT-SoVITS Installation

Speech-to-Speech Engines

RVC Installation

Check Torch Install

Updating the Package

Usage

Acknowledgements

Licensing

StyleTTS 2

F5 TTS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Languages

Packages