Indri

Indri is a series of multilingual audio models that can do TTS, ASR, and audio continuation. It currently supports these languages:

English
Hindi

This repo hosts the inference code for inference of Indri models.

Samples

Text	Sample
मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।	Sample
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।	Sample
Hello दोस्तों, future of speech technology mein अपका स्वागत है	Sample
In this model zoo, a new model called Indri has appeared.	Sample

Key features

Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture.
Ultra-fast. Using our self hosted service option, on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token for the 124m model.
On RTX6000Ada, it can support a batch size of ~1000 sequences with full context length of 1024 tokens
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.

Details

Model Type: GPT-2 based language model
Size: 124M parameters
Language Support: English, Hindi
License: This model is not for commercial usage. This is only a research showcase.

Here's a brief of how the model works:

Converts input text into tokens
Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
Decodes audio tokens (using Kyutai/mimi) to audio

Please read our blog here for more technical details on how it was built.

How to Get Started with the Model

🤗 pipelines

Use the code below to get started with the model. Pipelines are the best way to get started with the model.

import torch
import torchaudio
from transformers import pipeline

model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'

pipe = pipeline(
    task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)

Available speakers

Speaker ID	Speaker name
`[spkr_63]`	🇬🇧 👨 book reader
`[spkr_67]`	🇺🇸 👨 influencer
`[spkr_68]`	🇮🇳 👨 book reader
`[spkr_69]`	🇮🇳 👨 book reader
`[spkr_70]`	🇮🇳 👨 motivational speaker
`[spkr_62]`	🇮🇳 👨 book reader heavy
`[spkr_53]`	🇮🇳 👩 recipe reciter
`[spkr_60]`	🇮🇳 👩 book reader
`[spkr_74]`	🇺🇸 👨 book reader
`[spkr_75]`	🇮🇳 👨 entrepreneur
`[spkr_76]`	🇬🇧 👨 nature lover
`[spkr_77]`	🇮🇳 👨 influencer
`[spkr_66]`	🇮🇳 👨 politician

Self hosted service

git clone https://github.com/cmeraki/indri.git
cd indri
pip install -r requirements.txt

# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y

python -m server --model_path 11mlabs/indri-0.1-124m-tts --device cuda:0 --port 8000

Defaults:

device: cuda:0
port: 8000

Choices:

model_path: HuggingFace collection

Redirect to http://localhost:8000/docs to see the API documentation and test the service.

To run the GGUF quantized models, follow the instructions here.

Citation

If you use this model in your research, please cite:

@misc{indri-multimodal-alm,
  author       = {11mlabs},
  title        = {Indri: Multimodal audio language model},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/indri-voice/indri}},
  email        = {apurvagup@gmail.com, romit.73@gmail.com}
}

BibTex

@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}

Whisper

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

silero-vad

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
data		data
hf		hf
notebooks		notebooks
sample		sample
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
server.py		server.py
server_gguf.py		server_gguf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Indri

Samples

Key features

Details

How to Get Started with the Model

🤗 pipelines

Available speakers

Self hosted service

Citation

BibTex

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

indri-voice/indri

Folders and files

Latest commit

History

Repository files navigation

Indri

Samples

Key features

Details

How to Get Started with the Model

🤗 pipelines

Available speakers

Self hosted service

Citation

BibTex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages