Skip to content

args, server: Simplify and fix bugs #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

danielzgtg
Copy link
Collaborator

User-facing improvements:

  • Bare ./tts-cli -p no longer crashes
  • Merged --use-espeak into just omitting --phonemizer-path
  • Errors are now text/plain instead of JSON
    • We don't need to write so much error handling code
    • OpenAI compatibility isn't important for errors
  • Server startup page now only disappears after everything loads
  • Fixed buffer reuse race condition between audio generation and HTTP output
  • conditional_prompt is now a JSON key, no longer an HTTP endpoint
    • This lifts the limitation of 1 worker thread

Internal improvements:

  • struct arg is now pointer-free and like a regular std::variant
    • args["my-arg"] (+ cast rarely) should feel like Python
  • Deduplicated args.add, updated .md, unified descriptions
  • Converted many copies to pass-by-reference
    • Less related files are left for the code style PR
  • Cached the only 2 JSON success responses
  • The main thread is now simply the listener thread, not a worker
  • Timeouts are now also enforced on the HTTP threads
  • Removed simple_response_map to avoid collisions and thread contention
    • simple_text_prompt_task is pooled and no longer leaked every request
  • This is the first time we're freeing runner memory

This PR was split from the future of #58, and depends on and includes the PR #85.

It's been a few weeks and no opposition surfaced. This change allows
new code to be written faster and less cluttered.
kokoro_model was being freed twice, firstly by
kokoro_duration_runner while in use by kokoro_runner, which then had a
use-after-free when it tried to properly free it.
kokoro_context was also double-freeing some backend data that
its base class runner_context would later free again.
There was a mismatched new/free in prepare_post_load.

After removing the double-free, I chose to add unique_ptr to let this
part hold on to RAII and ownership. reference_wrapper is instead used in
another PR to indicate non-ownership. Nearby pointers have not been
upgraded to unique_ptr, because they involve backends, the situation of
which has been difficult to untangle.

Before, there was a SIGSEGV on post-args-refactor server shutdown.
After, there the server exits cleanly on Ctrl+C.
This isn't visible on the main branch, which just leaks the memory.
@danielzgtg danielzgtg force-pushed the refactor/args branch 2 times, most recently from ac0db97 to 16471e5 Compare June 3, 2025 14:41
@mmwillet
Copy link
Owner

mmwillet commented Jun 3, 2025

@danielzgtg Thanks! this looks good. I won't have time to review this tonight, but will review it tomorrow. Please wait on review before merging (seeing as this is a big one).

@mmwillet mmwillet self-requested a review June 3, 2025 17:06
User-facing improvements
* Bare `./tts-cli -p` no longer crashes
* Merged --use-espeak into just omitting --phonemizer-path
* Errors are now text/plain instead of JSON
  * We don't need to write so much error handling code
  * OpenAI compatibility isn't important for errors
* Server startup page now only disappears after everything loads
* Fixed buffer reuse race condition between audio generation and HTTP output
* conditional_prompt is now a JSON key, no longer an HTTP endpoint
  * This lifts the limitation of 1 worker thread

Internal improvements
* struct arg is now pointer-free and like a regular std::variant
  * `args["my-arg"]` (+ cast rarely) should feel like Python
* Deduplicated args.add, updated .md, unified descriptions
* Converted many copies to pass-by-reference
  * Less related files are left for the code style PR
* Cached the only 2 JSON success responses
* The main thread is now simply the listener thread, not a worker
* Timeouts are now also enforced on the HTTP threads
* Removed simple_response_map to avoid collisions and thread contention
  * simple_text_prompt_task is pooled and no longer leaked every request
* This is the first time we're freeing runner memory
}()};

constexpr str get_espeak_id_from_kokoro_voice(str voice) {
return KOKORO_LANG_TO_ESPEAK_ID[voice[0]] ? KOKORO_LANG_TO_ESPEAK_ID[voice[0]] : "gmw/en-US";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of cast to unsigned may permit an out-of-bounds read with Unicode, which none of the voices are named using, so is minor. It'll be fixed in the next PR.

Copy link
Owner

@mmwillet mmwillet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! Thanks for doing this. This is really excellent engineering work!

I've made some comments on a few minor issues that I've found while reviewing, but the only issue that is blocking is the change to no longer run the first worker on the main thread in server.cpp. See my comments below and this PR for more information.

The term Nit: prior to a pull request comment refers to the informal english verb nitpick which means to find or point out minor faults in a fussy or pedantic way. As such, when I use the prefix Nit: I am expressing a minor note that is often preferential and unnecessary. Feel free to ignore any Nit: comment that you disagree with. It will not bother me in the slightest.

using namespace std;
using namespace std::string_view_literals;
typedef std::string_view sv;
typedef const char * str;
Copy link
Owner

@mmwillet mmwillet Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (see the review comment for an explanation of what I mean when I prefix a comment with Nit:): I suspect it is more explicit and clear to just use const char * instead of str.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert both aliases. I'll see if I can move them to my IDE's live templates, so that it doesn't feel like 3x the amount of typing again every time I use a string.

std::string voice = "";
bool sample = true;
std::string espeak_voice_id = "";
bool use_cross_attn{true}; // TODO split out this load-time option from the rest of the generate-time configuration
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this isn't a load time option because we compile the graph at inference. With that said, I am probably the only one interested in using this option. There is little to no function for it in most use cases. My vote would be we simply remove it.

cout << " (-" << abbreviation << ")";
}
if (*description) {
cout << (required ? ":\n (REQUIRED) " : ":\n (OPTIONAL) ") << description << ".\n";
Copy link
Owner

@mmwillet mmwillet Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it is strange to have the last period added here given that all other punctuation is added by the description.

void update_conditional_prompt(tts_runner * runner, str file_path, str prompt, bool cpu_only) {
const auto parler{dynamic_cast<parler_tts_runner *>(runner)};
if (!parler) {
fprintf(stderr, "Wrong model for conditional prompt\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more explicit here.

Suggested change
fprintf(stderr, "Wrong model for conditional prompt\n");
fprintf(stderr, "The '%s' architecture does not support conditional generation.\n", SUPPORTED_ARCHITECTURES[runner->arch]);

void process_task(simple_text_prompt_task & task) {
tts_runner * runner = &*runners[task.model];
if (*task.conditional_prompt) {
TTS_ASSERT(*q.get().text_encoder_path);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't possible because of our request validation right (i.e. line 405)?

Additionally, I don't think we generally want to use TTS_ASSERT inside the worker where it can be avoided, especially in response to request options, as that would give the end user a way of terminating a worker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove as unnecessary. Although I interpret TTS_ASSERT as more like std::unreachable than input validation.

@mmwillet mmwillet self-requested a review June 5, 2025 17:40
Copy link
Owner

@mmwillet mmwillet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comments, I am fine with this being merged, assuming we fast follow with an improvement to the conditional prompt behavior.

mmwillet added 2 commits June 5, 2025 13:48
add back missing header from resolving merge conflicts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants