The code can be run under environment with Python 3.10, pytorch 2.1.2 and cuda 11.8. (It should run with other versions, but we have not tested it).
We recommend using Miniconda to set up an environment:
conda create --name inverse_painting python=3.10
conda activate inverse_painting
Install the required packages:
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Install LLaVA
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
cd ..
Install xformers
pip3 install -U xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118
We provide demo code to run our pretrained models on any target landscape painting.
Download pretrained models either from Huggingface or Google Drive, and then put them into the root folder. We recommend using the following commands for downloading from Huggingface:
git lfs install
git clone https://huggingface.co/boweiche/inverse_painting
After downloading, the pretrained models should be organized as follows:
./checkpoints/
|-- renderer
|-- RP
|-- TP_llava
|-- TP_llava_annotator # optional, only required for training.
./base_ckpt/
|-- clip-vit-base-patch32
|-- realisticVisionV51_v51VAE
For demo, we provide several target paintings in ./data/demo
. You can run the demo code using the following command
python demo.py
The generated results will be saved in results
.
The text generator, mask generator, and renderer are trained separately. You can train these models simutineously because GT text and mask instructions, instead of predicted ones, will be used to train mask generator and renderer.
We provide an example of sample data in data/sample_data
. Belows are the data structure before running the data pre-processing.
./data/sample_data/train
|-- rgb/ # folders of training samples
|-- example/ # name of the sample
|-- {ind}_{time}.jpg # name of each frame, {ind} is the frame index and {time} indicates the timestamp of the frame within the video.
|-- last_aligned_frame_inv.json # a json file to define the frame as target painting.
The following steps help you to pre-process this sample data for training, including the preparation of the GT text and mask instructions.
You can refer to the code in data_processing
. We also provide our processed data in data/sample_data_processed
for your reference.
This step prepares the ground truth (GT) text instructions by feeding both the GT current canvas and the GT next canvas into the LLaVA model.
In this codebase, rather than using the pretrained LLaVA model online, we utilize our fine-tuned version of LLaVA for more accurate GT text generation. If you have not downloaded TP_llava_annotator
in the previous step, you can download it from Google Drive and put it into the folder checkpoints
. This model has been fine-tuned using the GT current image, GT next image, and GT text from our dataset, with any inaccurate GT text manually corrected.
You can now run the following commands to prepare GT text instruction.
cd data_processing/run_llava
python main.py --save_vis --model_path ../../checkpoints/TP_llava_annotator # you can remove --save_vis if you don't want the visualization
python make_list.py # prepare the data format for the training of text generator
cd ../../
The generated text will be saved in data/sample_data/train/text
and data/sample_data/train/text_vis
(if you use --save_vis).
The training data in the format of LLaVA is in data/sample_data/train/llava_image
and data/sample_data/train/llava_json.json
This step prepares the GT text instructions by computing the LPIPS difference between GT current and next canvas.
You can now run the following commands to prepare mask text instruction.
cd data_processing/run_lpips
python main.py --save_vis # you can remove --save_vis if you don't want the visualization
cd ../../
The generated mask will be saved in data/sample_data/train/lpips
and data/sample_data/train/lpips_vis
(if you use --save_vis).
The training code is in training_scripts
. The following three models can be trained in any order because they are not dependent on each other.
cd training_scripts
bash train_text_generator.sh # This trains a lora of LLaVA, saved in `./checkpoints/llava-v1.5-7b-task-lora`. It will complete very fast because the sample dataset is very small
bash merge_ckpt.sh # After training, merge the lora with the base model, saved in `./checkpoints/llava-v1.5-7b-task-lora_final`
cd ..
cd training_scripts
torchrun --nnodes=1 --nproc_per_node=1 --master_port=25678 train_mask_generator.py --config ../configs/train/train_mask_gen.yaml
cd ..
This trains a Unet with cross-attention layers, saved in ./outputs/mask_gen
cd training_scripts
torchrun --nnodes=1 --nproc_per_node=1 --master_port=12678 train_renderer.py --config ../configs/train/train_renderer.yaml
cd ..
The output will be saved in ./outputs/renderer
This codebase is adpated from diffusers, Open-AnimateAnyone , and LLaVA.
We tested this codebase on a single NVIDIA A40 GPU. The result produced by this code might be slightly different when running on a different machine.
If you find our work useful for your research, please consider citing the paper:
@inproceedings{chen2024inverse,
title={Inverse Painting: Reconstructing The Painting Process},
author={Chen, Bowei and Wang, Yifan and Curless, Brian and Kemelmacher-Shlizerman, Ira and Seitz, Steven M},
booktitle={SIGGRAPH Asia 2024 Conference Papers},
year={2024}
}