[Bug] Short Reference Audio Error #349

UtkuBulkan · 2025-03-25T22:30:21Z

Describe the bug

When using the following code to create latents for XTTSv2 with get_conditioning_latents, if the reference audio is short; I face the following issue. How shall I fix this, any workaround ?

To Reproduce

audio_part = ["short_reference_audio.wav"]
config = XttsConfig()
config.load_json(f"{model_directory}config.json")
model = Xtts.init_from_config(config)
model.to("cpu")
model.load_checkpoint(config, checkpoint_dir=f"{model_directory}", use_deepspeed=False)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=audio_part)

Expected behavior

It should not throw an exception and create the correct latents.

Logs

`WARNING:root:Traceback: Traceback (most recent call last):
  File "/Users/utku/source/videoo/videoo-voice-cloner/voicecloner.py", line 918, in produce_voice_dubbing
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=audio_part)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 357, in get_conditioning_latents
    gpt_cond_latents = self.get_gpt_cond_latents(
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 283, in get_gpt_cond_latents
    cond_latent = torch.stack(style_embs).mean(dim=0)
RuntimeError: stack expects a non-empty TensorList`

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.1",
        "TTS": "0.25.3",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Darwin",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "arm",
        "python": "3.9.20",
        "version": "Darwin Kernel Version 24.2.0: Fri Dec  6 18:40:14 PST 2024; root:xnu-11215.61.5~2/RELEASE_ARM64_T8103"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

eginhard · 2025-03-28T12:47:43Z

if the reference audio is short

Could you be more specific about the length?

UtkuBulkan · 2025-03-28T12:50:33Z

Yes, 0.5 seconds and anything less than that reproduce the issue.

UtkuBulkan · 2025-03-30T21:58:37Z

Any idea on how I might come overcome this @eginhard ?

eginhard · 2025-03-31T09:54:42Z

You could pad your audio with silence until it reaches the minimum that works. I guess it would be possible for Coqui to do this internally or at least provide a more helpful error message about the file being too short, but on the other hand 0.5 seconds is extremely short and means that there is at most a single word in the audio, so the result won't be very good anyway.

UtkuBulkan added the bug Something isn't working label Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Short Reference Audio Error #349

[Bug] Short Reference Audio Error #349

UtkuBulkan commented Mar 25, 2025

eginhard commented Mar 28, 2025

UtkuBulkan commented Mar 28, 2025

UtkuBulkan commented Mar 30, 2025

eginhard commented Mar 31, 2025

[Bug] Short Reference Audio Error #349

[Bug] Short Reference Audio Error #349

Comments

UtkuBulkan commented Mar 25, 2025

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

eginhard commented Mar 28, 2025

UtkuBulkan commented Mar 28, 2025

UtkuBulkan commented Mar 30, 2025

eginhard commented Mar 31, 2025