Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Short Reference Audio Error #349

Open
UtkuBulkan opened this issue Mar 25, 2025 · 4 comments
Open

[Bug] Short Reference Audio Error #349

UtkuBulkan opened this issue Mar 25, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@UtkuBulkan
Copy link

Describe the bug

When using the following code to create latents for XTTSv2 with get_conditioning_latents, if the reference audio is short; I face the following issue. How shall I fix this, any workaround ?

To Reproduce

audio_part = ["short_reference_audio.wav"]
config = XttsConfig()
config.load_json(f"{model_directory}config.json")
model = Xtts.init_from_config(config)
model.to("cpu")
model.load_checkpoint(config, checkpoint_dir=f"{model_directory}", use_deepspeed=False)
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=audio_part)

Expected behavior

It should not throw an exception and create the correct latents.

Logs

`WARNING:root:Traceback: Traceback (most recent call last):
  File "/Users/utku/source/videoo/videoo-voice-cloner/voicecloner.py", line 918, in produce_voice_dubbing
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=audio_part)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 357, in get_conditioning_latents
    gpt_cond_latents = self.get_gpt_cond_latents(
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/utku/.pyenv/versions/3.9.20/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 283, in get_gpt_cond_latents
    cond_latent = torch.stack(style_embs).mean(dim=0)
RuntimeError: stack expects a non-empty TensorList`

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.1",
        "TTS": "0.25.3",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Darwin",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "arm",
        "python": "3.9.20",
        "version": "Darwin Kernel Version 24.2.0: Fri Dec  6 18:40:14 PST 2024; root:xnu-11215.61.5~2/RELEASE_ARM64_T8103"
    }
}

Additional context

No response

@UtkuBulkan UtkuBulkan added the bug Something isn't working label Mar 25, 2025
@eginhard
Copy link
Member

if the reference audio is short

Could you be more specific about the length?

@UtkuBulkan
Copy link
Author

Yes, 0.5 seconds and anything less than that reproduce the issue.

@UtkuBulkan
Copy link
Author

Any idea on how I might come overcome this @eginhard ?

@eginhard
Copy link
Member

You could pad your audio with silence until it reaches the minimum that works. I guess it would be possible for Coqui to do this internally or at least provide a more helpful error message about the file being too short, but on the other hand 0.5 seconds is extremely short and means that there is at most a single word in the audio, so the result won't be very good anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants