Skip to content

Waveout wrapper around the Bestspeech/keynote gold tts engine allowing for retrieval of synthesized audio in memory.

License

Notifications You must be signed in to change notification settings

Mohamed00/b32tts_wrapper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

b32_wrapper

What is this?

In late 2024, someone who goes by rommix0 introduced the blind community to a windows x86 version of the ancient yet saught after Keynote Gold speech synthesis engine.

Unfortunately, this version of the synthesizer came with the one major flaw that it's only output method was via the Win32 WaveOut API. This means that you couldn't use this to, for example, synthesize speech to a wave file, add effects to the output etc.

This project negates the aformentioned issue by actually hooking the waveOut functions and capturing the audio output sent to them, building it into a single, easily accessible memory buffer.

Bestspeech outputs audio in 16 bit 11025hz mono pcm.

How to build?

This project uses the SCons build system. If you have scons and a c++ compiler installed and on your path, you can just run scons -s in the root of this project to generate a .dll wrapper, a test program and a statically linked high level command line utility. Prebuilt binaries are provided on the releases page, or you can fork the repository and run the build action.

How to use?

Command line utility

If you just want to speak text out loud or to a file, you are likely going to be much more interested in the high level command line utility created from this wrapper instead of the wrapper's API itself. After downloading a binary release or building the project from source, you will see the file bin/b32_spk.exe. You can run b32_spk -h for the most up-to-date help on the program.

In short, the accepted switches are -v to select a voice, -t to set text to speak, -f to set an output filename and -r to set the speech rate.

For example,

b32_spk -ftts.wav -vHary -t"Hi there, my name is Hary!"

You can set the voice to a single questionmark (?) to list available voices. For example, b32_spk -v?

NVDA addon

This repository comes bundled with an NVDA addon that uses this wrapper. You can find a prebuilt version of the addon on this repository's releases page. To make it from source, build b32_wraqpper.dll, then copy bin/b32_tts.dll and bin/b32_wrapper.dll to the nvda/synth_drivers folder, before zipping the nvda directory up into a .nvda-addon package and installing it. This addon gives control of many parameters from basic things like pitch and rate to fun things like head size, excitation and unvoiced volume. It also comes bundled with Rommix's voices like the wrapper itself does, though the custom voices are done in the addon instead of using the wrapper's functionality for the purposes of updating the NVDA parameter sliders in realtime as a new voice is selected.

API

The main thing to note is that the dll generated by this project, b32_wrapper.dll, should be loaded and used in place of b32_tts.dll which this wrapper will handle internally.

The interface is only composed of a few functions, not all of which you will need:

const char** bst_voices(int* count = nullptr);

You can call this function to get a list of supported voice names. These are actually entirely defined by this wrapper as a bonus, using values provided by @rommix0's BST.h file. The result is a null terminated array of null terminated c strings.

bst_state* bst_init(const char* module_path = "b32_tts.dll");

bst_state* bst_init_w(const wchar_t* module_path = L"b32_tts.dll");

Call this function to initialize an instance of the tts engine and associated wrapper. The module_path argument can be set to any valid relative or absolute path to b32_tts.dll, which will be internally loaded upon a call to this function.

void bst_free(bst_state* s);

Call this function when you are done using the engine.

char* bst_speak(bst_state* s, long* size, const char* text, int voice = 0, int rate = 0, int gain = 0, bool pcm_header = true);

The main attraction as it were, this function converts a string of text to a byte array of pcm audio data using the bestspeech engine. The size argument should point to an integer that will be filled with the number of bytes returned. The voice argument can be >=0 to use a builtin voice (call bst_voices for a list), or a negative value to pass the text string to the synthesizer verbatim. Rate can be >0 to speed up, <0 to slow down. I believe gain is in db. If pcm_header is set to false, the 44 byte RIFF header will be omited from the output. Free return values with bst_free.

void bst_speak_async(bst_state* s, bst_async_callback callback, void* callback_user, const char* text, int voice = 0, int rate = 0, int gain = 0);

This version of the speak function allows your application to receive audio data as it is being synthesized, both for faster response times and for fewer memory management headaches. The callbacks take the signature bool bst_async_callback(char* text, long size, void* user) and only follow a couple of rules. Most importantly, you must process the provided data block BEFORE! the callback function returns, or else it's state in memory is undefined. Copy it if you need to. The callback can return false to abort synthesis, true to continue. The size argument is in bytes. This version of the function does not support PCM headers as it is designed for situations where it's output is meant to be streamed directly to an audio interface.

void bst_speech_free(char* data);

Call this to free any data returned by bst_speak. Do not use this within the context of bst_speak_async.

Notes

  • This has not been tested in a multithreaded conntext. It's probably best to create a new bst_state for each thread.
  • If you want to build the wrapper statically, make sure to define b32w_export to an empty value on the compiler command line so that the dllexport attribute won't be applied to the wrapper's functions, and make sure to link with bin/MinHook.lib in that case.
  • I did try to be careful here and make sure that the waveout hooks won't interfear with any other genuine usage of waveout in programs that use this wrapper. This is acomplished in 2 ways, first by leaving all API hooks disabled until right at the moment of synthesis then disabling them again the moment synthesis is complete, and also with a thread_local check that should passthrough any hooked waveout calls on threads other than the one invoking the tts engine.

About

Waveout wrapper around the Bestspeech/keynote gold tts engine allowing for retrieval of synthesized audio in memory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 49.9%
  • C++ 29.7%
  • Python 20.4%