Skip to content

CLIP-DisDiff: Extending DisDiff with CLIP integration for multimodal-guided disentanglement of diffusion models

License

Notifications You must be signed in to change notification settings

yu12ki04/CLIP-DisDiff

This branch is 65 commits ahead of ThomasMrY/DisDiff:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Jan 29, 2025
e495efc · Jan 29, 2025

History

67 Commits
Oct 7, 2023
Jan 29, 2025
Jan 29, 2025
Oct 7, 2023
Oct 7, 2023
Jan 29, 2025
Nov 28, 2024
Nov 28, 2024
Nov 28, 2024
Oct 7, 2023
Nov 15, 2024
Dec 6, 2024
Nov 28, 2024
Nov 28, 2024
Oct 7, 2023
Jan 29, 2025
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Dec 24, 2024
Jan 29, 2025
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Oct 7, 2023
Nov 15, 2024

Repository files navigation

CLIP-DisDiff: CLIP-Guided Disentanglement of Diffusion Models

License: MIT

This repository extends DisDiff [paper] [code]—a method for disentangling diffusion probabilistic models—by integrating OpenAI's CLIP for multimodal guidance. CLIP-DisDiff leverages both text and image encoders from CLIP to achieve more flexible and controllable latent factor disentanglement in diffusion models.


Overview

What is DisDiff?

DisDiff disentangles the gradient fields of a pretrained diffusion probabilistic model (e.g., DDPM) into separate sub-gradients, each corresponding to a latent factor (or concept). Once disentangled, you can individually manipulate these factors for more targeted image generation.

How Does CLIP-DisDiff Differ?

CLIP-DisDiff replaces the original image-based encoder with CLIP's text encoder (and optionally the image encoder) to guide the disentanglement process. Specifically:

  1. Text Attributes
    Instead of splitting a single image into multiple latent vectors, we directly feed text attributes (e.g., color, style) into the CLIP text encoder. Each attribute is encoded into a latent vector, naturally mapping "one text attribute → one latent vector."

  2. Decoder with Conditioned Gradients
    Each latent vector from CLIP is passed into the decoder (G_ψ), which computes the conditioned gradients for diffusion steps. This results in separate image predictions reflecting each text attribute.

  3. Disentangling Loss
    Similar to DisDiff, a disentangling loss encourages independence among latent factors. However, here we can also leverage CLIP's image encoder to align text embeddings and generated images more explicitly (optional).


Figure 1: High-level architecture of CLIP-DisDiff.


Requirements

A sample Conda environment file is included (environment.yaml). To create and activate the environment:

conda env create -f environment.yaml
conda activate clip-disdiff

Additionally, install the following:

pip install torch torchvision
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Installation

  1. Clone this repository:
git clone https://github.com/yu12ki04/CLIP-DisDiff.git
cd CLIP-DisDiff
  1. Install in editable mode:
pip install -e .

Usage

1. Train a Base Autoencoder (Optional)

If you plan to train a latent diffusion model (LDM) from scratch, you may need a base VAE or VQ-VAE. For instance:

python main.py --base configs/autoencoder/example_vq_4_16.yaml -t --gpus 0,

Adjust the config file as needed.

2. Train a Latent Diffusion Model (LDM)

After obtaining a pretrained autoencoder, specify its checkpoint path in the LDM config and run:

python main.py --base configs/latent-diffusion/example_vq_4_16.yaml -t --gpus 0,

3. Train CLIP-DisDiff

Specify both the pretrained LDM/VAE checkpoints and the CLIP encoder parameters in your config (e.g., configs/latent-diffusion/example_vq_4_16_clip-dis.yaml). Then:

python main.py \
  --base configs/latent-diffusion/example_vq_4_16_clip-dis.yaml \
  -t --gpus 0, \
  -l exp_clip_disdiff \
  -n experiment_name \
  -s 0
  • -n sets the experiment name
  • -s sets a random seed (optional)

4. Evaluation

Similar to DisDiff, evaluation can be performed with:

python run_para_metrics.py -l exp_clip_disdiff -p 10

Adjust arguments as needed (e.g., log directory, number of processes).

Model Configuration Example

Below is a simplified example of a config snippet integrating CLIP:

model:
  base_learning_rate: 1e-4
  params:
    # Checkpoint paths
    ckpt_path: "/path/to/pretrained_ldm"
    first_stage_config:
      ckpt_path: "/path/to/pretrained_vae"
    
    # CLIP text encoder
    cond_stage_config:
      text_encoder:
        target: ldm.modules.encoders.modules.FrozenCLIPTextEmbedder
        params:
          version: "ViT-L/14"
    
    # CLIP image encoder
      image_encoder:
        target: ldm.modules.encoders.modules.FrozenClipImageEmbedder
        params:
          model: "ViT-L/14"

    # Disentangling-specific settings
    ...

Model Configuration

The model can be configured through the YAML config files. Key parameters for CLIP integration:

model:
  params:
    cond_stage_config:
      text_encoder:
        target: ldm.modules.encoders.modules.FrozenCLIPTextEmbedder
        params:
          version: 'ViT-L/14'
      image_encoder:
        target: ldm.modules.encoders.modules.FrozenClipImageEmbedder
        params:
          model: 'ViT-L/14'

Citation

If you use this code for your research, please cite our work:

@inproceedings{yang2023disdiff,
  title={DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models},
  author={Yang, Tao and Wang, Yuwang and Lu, Yan and Zheng, Nanning},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

@misc{radford2021clip,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  howpublished={arXiv preprint arXiv:2103.00020},
  year={2021}
}

@article{rombach2022high,
  title={High-Resolution Image Synthesis with Latent Diffusion Models},
  author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
  journal={CVPR},
  year={2022}
}

Acknowledgements

This project builds upon:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or feedback, please:

Repository

The code is available at: CLIP-DisDiff

About

CLIP-DisDiff: Extending DisDiff with CLIP integration for multimodal-guided disentanglement of diffusion models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Roff 1.5%
  • Shell 0.1%