TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao | Zhejiang University

PyTorch implementation of TCSinger 2 (ACL 2025): Customizable Multilingual Zero-shot Singing Voice Synthesis.

Visit our demo page for audio samples.

News

2025.07: We released the code of TCSinger 2!
2025.07: We realeased the code of STARS!
2025.05: TCSinger 2 is accepted by ACL 2025!

Key Features

We present TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.
We introduce the Blurred Boundary Content Encoder for robust modeling and smooth transitions of phoneme and note boundaries.
We design the Custom Audio Encoder using contrastive learning to extract styles from various prompts, while the Flow-based Custom Transformer with Cus-MOE and F0, enhances synthesis quality and style modeling.
Experimental results show that TCSinger 2 outperforms baseline models in subjective and objective metrics across multiple tasks: zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.

Quick Start

We provide an example of how you can train your own model and infer with TCSinger 2.

To try on your own dataset, clone this repo on your local machine with NVIDIA GPU + CUDA cuDNN and follow the instructions below.

Dependencies

A suitable conda environment named tcsinger2 can be created and activated with:

conda create -n tcsinger2 python=3.10
conda install --yes --file requirements.txt
conda activate tcsinger2

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Train your own model

Data Preparation

Collect your own singing dataset, e.g., including GTSinger, and feel free to add extra data annotated with alignment tools, like STARS.
Place metadata.json (fields: ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types, emotion, singing_method, technique) and phone_set.json (complete phoneme list) in the desired folder and update the paths in preprocess/preprocess.py. (A reference metadata.json is provided in GTSinger.) Please present the singer attribute as a description specifying the performer’s gender and vocal range, and render the technique attribute either as a concise text listing of skills or as a natural-language account that conveys their sequential order.
Extract F0 for each .wav, save as *_f0.npy, e.g. with RMVPE.
Download HIFI-GAN as the vocoder in useful_ckpts/hifigan and FLAN-T5 in useful_ckpts/flan-t5-large.
Preprocess the dataset:

export PYTHONPATH=.
python preprocess/preprocess.py

Tip: You may also convert your dataset directly to a .csv instead of using metadata.json.

Compute mel-spectrograms:

python preprocess/mel_spec_48k.py --tsv_path data/new/data.tsv --num_gpus 1 --max_duration 20

Post-process:

python preprocess/postprocess_data.py

Training TCSinger 2

Train the VAE module and duration predictor

python main.py --base configs/ae_singing.yaml -t --gpus 0,1,2,3,4,5,6,7

Train the main TCSinger 2 model

python main.py --base configs/tcsinger2.yaml -t --gpus 0,1,2,3,4,5,6,7

Notes

Adjust the compression ratio in the config files (and related scripts).
Change the padding length in the dataloader as needed.
To train the Custom Audio Encoder, format data as in ldm/data/joinaudiodataset_con.py, set the trained VAE path in ae_con.yaml, and proceed with training.

Inference with TCSinger 2

python scripts/test_sing.py

Replace the checkpoint path and CFG coefficient as required. For speech inputs, modify the VAE accordingly.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio-3, TCSinger Lumina-T2X as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025tcsinger,
  title={TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Yao, Dongyu and Zhu, Zhiyuan and Jiang, Ziyue and Wang, Yuhan and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2505.14910},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao | Zhejiang University

News

Key Features

Quick Start

Dependencies

Multi-GPU

Train your own model

Data Preparation

Training TCSinger 2

Inference with TCSinger 2

Acknowledgements

Citations

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
ldm		ldm
preprocess		preprocess
scripts		scripts
utils		utils
vocoder		vocoder
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

AaronZ345/TCSinger2

Folders and files

Latest commit

History

Repository files navigation

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao | Zhejiang University

News

Key Features

Quick Start

Dependencies

Multi-GPU

Train your own model

Data Preparation

Training TCSinger 2

Inference with TCSinger 2

Acknowledgements

Citations

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages