Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Abstract: This paper proposes IDEA-TTS, an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned into the decoder for environment-aware speech generation. Experimental results demonstrate that IDEA-TTS achieves superior performance in the environment-aware TTS task, excelling in speech quality, speaker similarity, and environmental similarity. Additionally, IDEA-TTS is also capable of the acoustic environment conversion task and achieves state-of-the-art performance.

We provide our implementation as open source in this repository. Audio samples can be found at the demo website.

Pre-requisites

I’m sorry I wasn’t able to merge the virtual environments of YourTTS and VITS. If you have a solution, please let me know.

Clone this repository
Install python requirements.
1. You may need first to install the requirements of the YourTTS to extract speaker embeddings by:
```
$ conda create -n yourtts python=3.9
$ git clone https://github.com/coqui-ai/TTS
$ pip install -e .
```
2. Then install the requirements of the IDEA-TTS referring requirements.txt:
```
$ conda create -n ideatts python=3.9
$ pip install -r requirements.txt
```
Download datasets
1. Download and extract the DDS dataset.
2. Trim the silence of the dataset, and the trimmed files will be saved to DDS/VCTK_16k_trimmed and DDS/DAPS_16k_trimmed.
```
$ python trim_silence.py
```
Extract speaker embeddings:
1. Download the checkpoint files of the speaker encoder and pretrained IDEA-TTS, and move them to the checkpoint dir.
2. Preprocess the dataset to extract speaker embeddings
```
$ conda activate yourtts
$ python preprocess_se.py
```

Build Monotonic Alignment Search

cd monotonic_align
python setup.py build_ext --inplace

Training

$ conda activate ideatts
$ CUDA_VISIBLE_DEVICES=0 python train.py -c [config file path] -m [checkpoint file path]

The checkpoint file will be saved in the [checkpoint file path], here's an example:

$ CUDA_VISIBLE_DEVICES=0 python train.py -c config.json -m checkpoints/IDEA-TTS

Inference

Inference for the environment-aware TTS task

$ conda activate yourtts
$ CUDA_VISIBLE_DEVICES=0 python inference_tts.py --checkpoint_file [checkpoint file path] --output_wavs_dir [output dir path]

You can use the pretrained weights we provide. Generated wav files are saved in the [output dir path]. Here is an example:

$ CUDA_VISIBLE_DEVICES=0 python inference_tts.py --checkpoint_file checkpoints/ckpt_tts/model.pth  --output_wavs_dir generated_files/EA-TTS

Inference for the acoustic environment conversion task

$ conda activate yourtts
$ CUDA_VISIBLE_DEVICES=0 python inference_ec.py -checkpoint_file [checkpoint file path] --output_wavs_dir [output dir path]

Here is an example:

$ CUDA_VISIBLE_DEVICES=0 python inference_ec.py --checkpoint_file checkpoints/ckpt_tts/model.pth --input_text_file filelists_ec/env2clean.txt --output_wavs_dir generated_files/Env2Clean

Acknowledgements

We referred to VITS and YourTTS to implement this.

Citation

@inproceedings{lu2025incremental,
  title={Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis},
  author={Lu, Ye-Xin and Du, Hui-Peng and Sheng, Zheng-Yan and Ai, Yang and Ling, Zhen-Hua},
  booktitle={Proc. ICASSP},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Pre-requisites

Training

Inference

Acknowledgements

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
checkpoints		checkpoints
datasets		datasets
docs		docs
filelists		filelists
filelists_ec		filelists_ec
models		models
monotonic_align		monotonic_align
speaker_encoder		speaker_encoder
text		text
README.md		README.md
config.json		config.json
inference_ec.py		inference_ec.py
inference_tts.py		inference_tts.py
preprocess_se.py		preprocess_se.py
requirements.txt		requirements.txt
train.py		train.py
trim_silence.py		trim_silence.py
utils.py		utils.py

yxlu-0102/IDEA-TTS

Folders and files

Latest commit

History

Repository files navigation

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Pre-requisites

Training

Inference

Acknowledgements

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages