REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Official PyTorch implementation of "REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding" [ICCV 2025 under review].

Updates

This repository contains the official implementation and dataset of the following paper:

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Abstract: Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks.

Todo

Release the training and inference code.
Release the checkpoints.
Release the VT-Instruct dataset.
Release the demo.

Install

Dependencies

This project is built on Xtuner. Please refer to the official documents of these toolkits for installation guidance.
Dataset load is base on detectron2.
MMDetection
COCO 2018 Panoptic Segmentation Task API

configure accelerate

accelerate config

Dataset

Coming soon.

REV-VLM/
├── checkpoints
    ├── vicuna_7b
        ├──stage1
            ├──instances.json
            ├──refs(unc).p
        ├── stage2
        ├── hf_model

Checkpoint

Coming soon.

REV-VLM/
├── checkpoints
    ├── vicuna_7b
        ├──stage1
            ├──instances.json
            ├──refs(unc).p
        ├── stage2
        ├── hf_model

Demo

To launch a Gradio web demo, use the following command. Please note that the model evaluates in the torch.float16 format, which requires a GPU with at least 16GB of memory.

python demo/app.py --config /path/to/config

Train

After preparing data, you can train the model using the command:

Stage1

NPROC_PER_NODE=8 xtuner train configs/train_stage1.py --deepspeed deepspeed_zero2

Stage2

NPROC_PER_NODE=8 xtuner train configs/train_stage2.py --deepspeed deepspeed_zero2

Stage3

NPROC_PER_NODE=8 xtuner train configs/train_stage3_keypoint.py --deepspeed deepspeed_zero2

Name	Name	Last commit message	Last commit date
Latest commit MacavityT [taiyan] Update rev-vlm Mar 10, 2025 97dca98 · Mar 10, 2025 History 43 Commits
configs	configs	[taiyan] Update rev-vlm	Mar 10, 2025
demo	demo	[taiyan] Update rev-vlm	Mar 10, 2025
images	images	[taiyan] update readme	Mar 9, 2025
ref_vlm	ref_vlm	[taiyan] Update rev-vlm	Mar 10, 2025
scripts	scripts	[taiyan] Update rev-vlm	Mar 10, 2025
.gitignore	.gitignore	[taiyan] Added ablation configs	Mar 9, 2025
README.md	README.md	[taiyan] Update rev-vlm	Mar 10, 2025
requirements.txt	requirements.txt	[taiyan] Debug mode success	Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Updates

Todo

Get Start

Install

Dependencies

configure accelerate

Dataset

Checkpoint

Demo

Train

Stage1

Stage2

Stage3

Cite

About

Releases

Packages

Contributors 3

Languages

MacavityT/REF-VLM

Folders and files

Latest commit

History

Repository files navigation

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

Updates

Todo

Get Start

Install

Dependencies

configure accelerate

Dataset

Checkpoint

Demo

Train

Stage1

Stage2

Stage3

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages