AudioLLM is a research project that combines large language models (LLaMA) with audio processing capabilities (Whisper) to create a powerful multimodal system capable of understanding and responding to both text and audio inputs.
- Combines LLaMA language model with Whisper audio encoder
- Audio projection layer that maps Whisper features to LLaMA embedding space
- Parameter-efficient training with only projector and LoRA layers trainable
- Supports mixed precision training
- Comprehensive training, evaluation and inference pipelines
AudioLLM works by:
- Processing audio inputs through the Whisper encoder
- Projecting audio features into LLaMA's embedding space using a learnable projection layer
- Integrating audio embeddings with text using special tokens (
<audio>
,</audio>
) - Fine-tuning only the projection layer and LoRA adapters, keeping base models frozen
# Clone the repository
git clone https://github.com/cdreetz/audio-llm.git
cd audio-llm
# Create a virtual environment
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
cd src
./run_librispeech.sh
python download_huggingface.py --max-wer 5.0
# audio files saved in ./data/huggingface dir
# examples saved in instruction_examples.json
python train.py \
--data_path /path/to/dataset.json \
--audio_dir /path/to/audio/files \
--llama_path meta-llama/Llama-3.2-3B-Instruct \
--whisper_path openai/whisper-large-v3-turbo \
--output_dir ./checkpoints \
--batch_size 8 \
--num_epochs 5 \
--learning_rate 5e-5
--fp16
--use_wandb
See train.py
for additional training options.
python inference.py \
# TODO
# TODO
The dataset should be a JSON file with entries in the following format:
[
{
"text": "Describe the audio: <audio>",
"audio_paths": "sample1.wav",
"response": "This is a recording of a piano playing."
},
{
"text": "What can you hear in this recording? <audio>",
"audio_paths": "sample2.wav",
"response": "This recording contains a person speaking in English."
}
]
If you use this code in your research, please cite:
@misc{audioLLM2025,
title = {AudioLLM: Multimodal Adapter for Audio Understanding},
year = {2025},
}
Contributions are welcome! Please feel free to submit a Pull Request.