AudioLLM: Multimodal Language Model with Audio Understanding

AudioLLM is a research project that combines large language models (LLaMA) with audio processing capabilities (Whisper) to create a powerful multimodal system capable of understanding and responding to both text and audio inputs.

Features

Combines LLaMA language model with Whisper audio encoder
Audio projection layer that maps Whisper features to LLaMA embedding space
Parameter-efficient training with only projector and LoRA layers trainable
Supports mixed precision training
Comprehensive training, evaluation and inference pipelines

Architecture

AudioLLM works by:

Processing audio inputs through the Whisper encoder
Projecting audio features into LLaMA's embedding space using a learnable projection layer
Integrating audio embeddings with text using special tokens (<audio>, </audio>)
Fine-tuning only the projection layer and LoRA adapters, keeping base models frozen

Installation

# Clone the repository
git clone https://github.com/cdreetz/audio-llm.git
cd audio-llm

# Create a virtual environment
python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

Data

LibriSpeech

cd src
./run_librispeech.sh

Hugging Face (Recommended)

python download_huggingface.py --max-wer 5.0
# audio files saved in ./data/huggingface dir
# examples saved in instruction_examples.json

Training

python train.py \
    --data_path /path/to/dataset.json \
    --audio_dir /path/to/audio/files \
    --llama_path meta-llama/Llama-3.2-3B-Instruct \
    --whisper_path openai/whisper-large-v3-turbo \
    --output_dir ./checkpoints \
    --batch_size 8 \
    --num_epochs 5 \
    --learning_rate 5e-5
    --fp16
    --use_wandb

See train.py for additional training options.

Inference

python inference.py \
# TODO

Evaluation

# TODO

Dataset Format

The dataset should be a JSON file with entries in the following format:

[
  {
    "text": "Describe the audio: <audio>",
    "audio_paths": "sample1.wav",
    "response": "This is a recording of a piano playing."
  },
  {
    "text": "What can you hear in this recording? <audio>",
    "audio_paths": "sample2.wav",
    "response": "This recording contains a person speaking in English."
  }
]

License

MIT License

Citation

If you use this code in your research, please cite:

@misc{audioLLM2025,
  title = {AudioLLM: Multimodal Adapter for Audio Understanding},
  year = {2025},
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
librispeech_data		librispeech_data
old		old
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
Screenshot20atE2AFPM.png		Screenshot20atE2AFPM.png
config.yaml		config.yaml
pyproject.toml		pyproject.toml
remote.md		remote.md
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioLLM: Multimodal Language Model with Audio Understanding

Features

Architecture

Installation

Usage

Data

LibriSpeech

Hugging Face (Recommended)

Training

Inference

Evaluation

Dataset Format

License

Citation

Contributing

About

Releases

Packages

Languages

cdreetz/audio-llama

Folders and files

Latest commit

History

Repository files navigation

AudioLLM: Multimodal Language Model with Audio Understanding

Features

Architecture

Installation

Usage

Data

LibriSpeech

Hugging Face (Recommended)

Training

Inference

Evaluation

Dataset Format

License

Citation

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages