Tuning-Free Image Customization with Image and Text Guidance

Overview

1. We propose a tuning-free image customization framework, enabling content manipulation in the given region(s) of an image according to user-provided example images and text descriptions. 2. We propose a self-attention blending strategy for content customization, which addresses the issue of unintended changes in non-target area in previous image editing methods and achieves precise editing of specific theme attributes. 3. We propose a blended self-attention strategy for content customization, which addresses the issue 4. Our method outperforms previous approaches in human and quantitative evaluations, providing an efficient solution for numerous practical applications such as image synthesis, design, and creative photography.

Method

The pipeline of our method. Given an image $I$ to be edited and the target region(s) $R$ that needs edition, our goal is to synthesize an image $I_e$ that not only has the subject in the reference image(s) $I_r$ but also satisfies the description of text $T$ in a tuning-free manner. The text $T$ is utilized for controlling the attributes of the customized subject in $R$. This is a challenging task due to the following issues: (1) maintaining consistency in the non-target region between $I$ and $I_e$; (2) ensuring semantic coherence between the generated subject and the reference subject in the target region; (3) accurately controlling the attributes of the generated subject without changing the other part according to the text description; and (4) seamlessly integrating the generated subject in $R$ with the non-target region content in $I_e$.

Experiments

Qualitative comparison with existing state-of-the-art methods. PBE and AnyDoor are methods guided only by images, while BLD uses text as the only guidance. To evaluate the efficiency of our method, we set up an additional group of two-step methods, including first using image stitching and harmonization followed by text guided image editing (DCCF + IP2P, MasaCtrl) and another method involving editing first and then harmonizing (IP2P + DCCF). These methods can only focus on text or image, global or local editing. Our method outperforms all these methods and overcomes their limitations, achieving text and image guided local editing and generation.

Potential applications

Some creative applications. As shown in the first row, given an indoor scene and a collection of materials, our method can edit the interior decorations and furnishings using reference subjects from the material library. Our method can also be applied to cross-domain graphic design creations, as shown in the second column, where cartoon characters are generated directly in real-world scenes.

More results

We show more visual comparison results. Our method outperforms all these methods and overcomes their limitations, achieving outstanding generative performance.

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.

Citation

@inproceedings{li2024tuning,
      title={Tuning-Free Image Customization with Image and Text Guidance}, 
      author={Li, Pengzhi and Nie, Qiang and Chen, Ying and Jiang, Xi and Wu, Kai and Lin, Yuhuan and Liu, Yong and Peng, Jinlong and Wang, Chengjie and Zheng, Feng},
      booktitle={European Conference on Computer Vision},
      year={2024}
  }