EmojiVoice 🎉 Towards long-term controllable expressivity in robot speech

An expressive pseudo Speech-to-Speech system 🗣️ for HRI experiments 🤖, a part of Do You Feel Me?

Paige Tuttösí, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, and Angelica Lim

This is the official code implementation of EmojiVoice for [RO-MAN 2025].

We have created a wrapper for Matcha-TTS to aid HRI researchers in training custom light-weight, expressive voices

We have added:

Training files setup: examples, raw data, and 3 checkpoints (with and without optimizers)
Additional information on the amount of data needed to fine-tune
Scripts to record the data
Wrappers to parse emojis in text to prompt the voices in generation time
A conversational agent chaining ASR -> LLM -> EmojiVoice

Read the paper here

See our demo page here

v1.0.0 updates

Emojivoice is now supports multilingual for

French
German
Japanese - with an updated phonemizer

Coming soon

Your updates! Please reach out and make PRs for any issues or needed updates

Also contact if you are interested in different languages

Structure

The system is structured as follows:

ASR -> LLM -> TTS

ASR

Modified version of:

Whisper

LLM

Ollama and langchain chatbot implementation of:

Llama3

TTS

Fine tuned:

Matcha TTS

We currently have 3 available emoji checkpoints:

Paige - Female, intense emotions
Olivia - Female, subtle emotions
Zach - Male

Current checkpoints and data can be found here We have left an empty folder (Matcha-TTS/models) where we suggest storing them and where they must be stored to directly run our case-studies

Too see per model (WhisperLive and Matcha-TTS) information and make edits within the pipeline see internal READMEs in the respective folders

Useage

Clone this repo

git clone git@github.com:rosielab/do_you_feel_me.git

Create conda environment or virtualenv and install the requirements

conda create -n emojivoice python=3.11 -y
conda activate emojivoice

Note this repo has been tested with python 3.11.9

cd emojivoice/Matcha-TTS
pip install -e .

Example implementations

Example implementations for case studies can be found in case_studies

Example implementations with Pepper robot can be found in hri-demo

Speech-to-Speech system:

You will need to pull the llama 3 model - This model is best for English may need to change for other languages or use cases

If you are using Japanese it seems that this model is not very good at Japanese and we suggest trying another

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3

If not already running ollama, you may need to run this before run llama3

ollama serve

You will need espeak to run Matcha-tts

sudo apt-get install espeak-ng

You will find the code for the conversational agent in feel_me.py

At the top you will find many possible customizations (see below) but also some variables to be set to your environment. Specifically the path to your model checkpoints, the language (the whisper model will also need to be changed), and the emoji to speaker mapping this is under TTS PARAMETERS.

Then run:

python feel_me.py

You can end the session by saying 'end session'

Customize

It is possible to customize the pipeline. You can perform the following modifications:

Modify the LLM prompt and emojis
Change to a different LLM available from Ollama
Change the Whisper model
Change the temperature of the TTS and LLM
Use a different Matcha-TTS checkpoint
Modify the speaking rate
Change the number of steps in the ODE solver for the TTS
Change the TTS vocoder

All of these changes can be found at the top of the feel_me.py

Currently the system contains 11 emoji voices: 😎🤔😍🤣🙂😮🙄😅😭😡😁 If you wish to change the personality of the chatbot or the emojis used by the chatbot edit the PROMPT parameter

If you wish to use a different voice or add new emojis you can quickly and easily fine tune Matcha-TTS to create your own voice

Fine tune TTS

Matcha TTS can be fine tuned for your own emojis within as little as 2 minutes of data per emoji. The new checkpoint can be trained directly from the base Matcha-tts checkpoint (see README for links) or from our provided checkpoints.

You can use our script record_audio.py to easily record your data and get_duration.ipynb to check the duration of all of your recordings. If fine tuning from a checkpoint the sampling rate for the audio files must be 22050.

To record audio create a <emoji_name>.txt where each line is a script to read, then set the emoji and emoji name (file name), with the EMOJI_MAPPING parameter in record_audio.py

When fine tuning you will be overwriting the current voices, in general, we have produced better quality voices when selecting a voice to overwrite that is more similar to the target voice, e.g. same accent and gender. To easily hear all the voices along with their speaker numbers use this hugging face space.

Follow the information in README for fine tuning on the vctk checkpoint where each speaker number is an emoji number. You may see our data and transcription set up in emojis-hri-clean.zip here as an example.

With the multilingual update we have trained a cleaner and more robust English baseline we suggest fine tuning off of `LibriTTS-R-emoji-base-training.ckpt`

We provide other base voices for other languages, however, we do not guarantee how successfully they can be fine tuned

FOR MULTILINGUAL FINE TUNING THE CLEANERS MUST BE SET IN THE CONFIGS see your corresponding cleaner in cleaners)

Hints: for fine tuning

You want to have very clean, high quality audio for the best results

First create your own experiment and data configs following the examples mapping to your trascription file location. The two primary configs to create (and check out the paths to the data) are one in data and one in experiments. The paths here should point to where your train and validation files are stored, and your train and validation files should point to your audio file locations. You can test that all these files are pointing the right way before training when you run: matcha-data-stats -i ljspeech.yaml as per the matcha repo training steps.

Then follow the orginal Matcha-TTS instructions

To train from a checkpoint run:

python matcha/train.py experiment=<YOUR EXPERIMENT> ckpt_path=<PATH TO CHECKPOINT>

You can train off of the matcha base release checkpoints or the emojivoice checkpoints.

To run multi-speaker synthesis:

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --spk <SPEAKER NUMBER> --vocoder hifigan_univ_v1 --speaking_rate <SPEECH RATE>

If you are having issues, sometimes cuda will make the error messages convoluted, run training in cpu(set accelerator to cpu and remove devices) mode to get more clear error outputs.

Command line synthesis

Installation

Create an environment (suggested but optional)

conda create -n emojivoice python=3.11 -y
conda activate emojivoice

Install Matcha TTS from source

cd emojivoice/Matcha-TTS
pip install -e .

Run CLI

We have added a play only option, which is used in the emojivoice experiment set ups. Here the audio is played and no .wav file is saved

The default language is English, please ensure you provide the correct language to match your checkpoint

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --play

Language other than English

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --play --language fr

To save the audio file

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>

CLI Arguments

To synthesise from a file, run:

matcha-tts --file <PATH TO FILE> --checkpoint_path <PATH TO CHECKPOINT> --play

To batch synthesise from a file, run:

matcha-tts --file <PATH TO FILE> --checkpoint_path <PATH TO CHECKPOINT> --batched --play

Additional arguments

Speaking rate

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --speaking_rate 1.0 --play

Sampling temperature

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --temperature 0.667 --play

Euler ODE solver steps

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> --steps 10 --play

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.github		.github
Matcha-TTS		Matcha-TTS
case_studies		case_studies
hri-demo		hri-demo
notebooks		notebooks
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
.pylintrc		.pylintrc
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
feel_me.py		feel_me.py
output.wav		output.wav
pyproject.toml		pyproject.toml
setup.py		setup.py
synthesis.ipynb		synthesis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EmojiVoice 🎉 Towards long-term controllable expressivity in robot speech

Paige Tuttösí, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, and Angelica Lim

v1.0.0 updates

Coming soon

Structure

ASR

LLM

TTS

Useage

Example implementations

Speech-to-Speech system:

Customize

Fine tune TTS

With the multilingual update we have trained a cleaner and more robust English baseline we suggest fine tuning off of `LibriTTS-R-emoji-base-training.ckpt`

Hints: for fine tuning

Command line synthesis

Installation

CLI Arguments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

rosielab/emojivoice

Folders and files

Latest commit

History

Repository files navigation

EmojiVoice 🎉 Towards long-term controllable expressivity in robot speech

Paige Tuttösí, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, and Angelica Lim

v1.0.0 updates

Coming soon

Structure

ASR

LLM

TTS

Useage

Example implementations

Speech-to-Speech system:

Customize

Fine tune TTS

With the multilingual update we have trained a cleaner and more robust English baseline we suggest fine tuning off of LibriTTS-R-emoji-base-training.ckpt

Hints: for fine tuning

Command line synthesis

Installation

CLI Arguments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

With the multilingual update we have trained a cleaner and more robust English baseline we suggest fine tuning off of `LibriTTS-R-emoji-base-training.ckpt`

Packages