CLIP-DisDiff: CLIP-Guided Disentanglement of Diffusion Models

This repository extends DisDiff [paper] [code]—a method for disentangling diffusion probabilistic models—by integrating OpenAI's CLIP for multimodal guidance. CLIP-DisDiff leverages both text and image encoders from CLIP to achieve more flexible and controllable latent factor disentanglement in diffusion models.

Overview

What is DisDiff?

DisDiff disentangles the gradient fields of a pretrained diffusion probabilistic model (e.g., DDPM) into separate sub-gradients, each corresponding to a latent factor (or concept). Once disentangled, you can individually manipulate these factors for more targeted image generation.

How Does CLIP-DisDiff Differ?

CLIP-DisDiff replaces the original image-based encoder with CLIP's text encoder (and optionally the image encoder) to guide the disentanglement process. Specifically:

Text Attributes
Instead of splitting a single image into multiple latent vectors, we directly feed text attributes (e.g., color, style) into the CLIP text encoder. Each attribute is encoded into a latent vector, naturally mapping "one text attribute → one latent vector."
Decoder with Conditioned Gradients
Each latent vector from CLIP is passed into the decoder (G_ψ), which computes the conditioned gradients for diffusion steps. This results in separate image predictions reflecting each text attribute.
Disentangling Loss
Similar to DisDiff, a disentangling loss encourages independence among latent factors. However, here we can also leverage CLIP's image encoder to align text embeddings and generated images more explicitly (optional).

Figure 1: High-level architecture of CLIP-DisDiff.

Requirements

A sample Conda environment file is included (environment.yaml). To create and activate the environment:

conda env create -f environment.yaml
conda activate clip-disdiff

Additionally, install the following:

pip install torch torchvision
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Installation

Clone this repository:

git clone https://github.com/yu12ki04/CLIP-DisDiff.git
cd CLIP-DisDiff

Install in editable mode:

pip install -e .

Usage

1. Train a Base Autoencoder (Optional)

If you plan to train a latent diffusion model (LDM) from scratch, you may need a base VAE or VQ-VAE. For instance:

python main.py --base configs/autoencoder/example_vq_4_16.yaml -t --gpus 0,

Adjust the config file as needed.

2. Train a Latent Diffusion Model (LDM)

After obtaining a pretrained autoencoder, specify its checkpoint path in the LDM config and run:

python main.py --base configs/latent-diffusion/example_vq_4_16.yaml -t --gpus 0,

3. Train CLIP-DisDiff

Specify both the pretrained LDM/VAE checkpoints and the CLIP encoder parameters in your config (e.g., configs/latent-diffusion/example_vq_4_16_clip-dis.yaml). Then:

python main.py \
  --base configs/latent-diffusion/example_vq_4_16_clip-dis.yaml \
  -t --gpus 0, \
  -l exp_clip_disdiff \
  -n experiment_name \
  -s 0

-n sets the experiment name
-s sets a random seed (optional)

4. Evaluation

Similar to DisDiff, evaluation can be performed with:

python run_para_metrics.py -l exp_clip_disdiff -p 10

Adjust arguments as needed (e.g., log directory, number of processes).

Model Configuration Example

Below is a simplified example of a config snippet integrating CLIP:

model:
  base_learning_rate: 1e-4
  params:
    # Checkpoint paths
    ckpt_path: "/path/to/pretrained_ldm"
    first_stage_config:
      ckpt_path: "/path/to/pretrained_vae"
    
    # CLIP text encoder
    cond_stage_config:
      text_encoder:
        target: ldm.modules.encoders.modules.FrozenCLIPTextEmbedder
        params:
          version: "ViT-L/14"
    
    # CLIP image encoder
      image_encoder:
        target: ldm.modules.encoders.modules.FrozenClipImageEmbedder
        params:
          model: "ViT-L/14"

    # Disentangling-specific settings
    ...

Model Configuration

The model can be configured through the YAML config files. Key parameters for CLIP integration:

model:
  params:
    cond_stage_config:
      text_encoder:
        target: ldm.modules.encoders.modules.FrozenCLIPTextEmbedder
        params:
          version: 'ViT-L/14'
      image_encoder:
        target: ldm.modules.encoders.modules.FrozenClipImageEmbedder
        params:
          model: 'ViT-L/14'

Citation

If you use this code for your research, please cite our work:

@inproceedings{yang2023disdiff,
  title={DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models},
  author={Yang, Tao and Wang, Yuwang and Lu, Yan and Zheng, Nanning},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

@misc{radford2021clip,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  howpublished={arXiv preprint arXiv:2103.00020},
  year={2021}
}

@article{rombach2022high,
  title={High-Resolution Image Synthesis with Latent Diffusion Models},
  author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
  journal={CVPR},
  year={2022}
}

Acknowledgements

This project builds upon:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or feedback, please:

Open an issue in this repository
Contact: your-email@example.com

Repository

The code is available at: CLIP-DisDiff

Name	Name	Last commit message	Last commit date
Latest commit yu12ki04 file name Jan 29, 2025 e495efc · Jan 29, 2025 History 67 Commits
.vscode	.vscode	init_update	Oct 7, 2023
assets	assets	feat: update configuration files for CLIP-DisDiff	Jan 29, 2025
configs	configs	feat: update configuration files for CLIP-DisDiff	Jan 29, 2025
data	data	init_update	Oct 7, 2023
evaluation	evaluation	init_update	Oct 7, 2023
ldm	ldm	feat: update configuration files for CLIP-DisDiff	Jan 29, 2025
src	src	freeze clip params	Nov 28, 2024
taming-transformers	taming-transformers	freeze clip params	Nov 28, 2024
taming	taming	freeze clip params	Nov 28, 2024
utils	utils	init_update	Oct 7, 2023
.DS_Store	.DS_Store	for examination	Nov 15, 2024
.gitignore	.gitignore	gitignore	Dec 6, 2024
=0.7.5	=0.7.5	freeze clip params	Nov 28, 2024
=0.73.1	=0.73.1	freeze clip params	Nov 28, 2024
LICENSE	LICENSE	init_update	Oct 7, 2023
README.md	README.md	file name	Jan 29, 2025
ae_utils_exp.py	ae_utils_exp.py	init_update	Oct 7, 2023
environment.yaml	environment.yaml	init_update	Oct 7, 2023
eval_celeba_func.py	eval_celeba_func.py	init_update	Oct 7, 2023
evaluate.py	evaluate.py	init_update	Oct 7, 2023
evaluate_gen.py	evaluate_gen.py	init_update	Oct 7, 2023
evaluate_med.py	evaluate_med.py	init_update	Oct 7, 2023
inception.py	inception.py	init_update	Oct 7, 2023
main.py	main.py	multiple decoder and subgrads	Dec 24, 2024
notebook_helpers.py	notebook_helpers.py	feat: update configuration files for CLIP-DisDiff	Jan 29, 2025
pytorch_fid.py	pytorch_fid.py	init_update	Oct 7, 2023
run_para_metrics.py	run_para_metrics.py	init_update	Oct 7, 2023
run_para_metrics_med.py	run_para_metrics_med.py	init_update	Oct 7, 2023
sample_diffusion.py	sample_diffusion.py	init_update	Oct 7, 2023
train.sh	train.sh	for examination	Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP-DisDiff: CLIP-Guided Disentanglement of Diffusion Models

Overview

What is DisDiff?

How Does CLIP-DisDiff Differ?

Requirements

Installation

Usage

1. Train a Base Autoencoder (Optional)

2. Train a Latent Diffusion Model (LDM)

3. Train CLIP-DisDiff

4. Evaluation

Model Configuration Example

Model Configuration

Citation

Acknowledgements

License

Contact

Repository

About

Releases

Packages

Languages

License

yu12ki04/CLIP-DisDiff

Folders and files

Latest commit

History

Repository files navigation

CLIP-DisDiff: CLIP-Guided Disentanglement of Diffusion Models

Overview

What is DisDiff?

How Does CLIP-DisDiff Differ?

Requirements

Installation

Usage

1. Train a Base Autoencoder (Optional)

2. Train a Latent Diffusion Model (LDM)

3. Train CLIP-DisDiff

4. Evaluation

Model Configuration Example

Model Configuration

Citation

Acknowledgements

License

Contact

Repository

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages