HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

1Institute of Computing Technology, Chinese Academy of Sciences
2Institute of Automation, Chinese Academy of Sciences
3Beihang University
4University of Konstanz
5National Cheng-Kung University
MY ALT TEXT

Results of HeadRouter demonstrate accurate text-guided semantic representation while preserving consistency with the source image across diverse editing tasks.

Abstract

Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.

Method

MY ALT TEXT

Our method based on two key insights: (a) The various image semantics are adaptively distributed across different heads for MM-DiTs. (b) We identify and extract critical regions in the joint self-attention map where text tokens influence image tokens. In light of these observations, we present HeadRouter, a training-free image editing framework for MM-DiTs. An instance-adaptive attention head router (IARouter) is put forward, which adaptively activates attention heads based on their semantic sensitivity, thereby enabling a more accurate representation of the edited specific images. We further propose a dual-token refinement module (DTR) that employs self-enhancement of image tokens and rectified text token methods to enhance the representation of text-guided editing features in key regions and deep joint self-attention blocks.

Image Results

Comparisons with baselines

MY ALT TEXT

More of our results

MY ALT TEXT



<--

BibTeX

@article{xu2024headrouter,
  title={HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads},
  author={Xu, Yu and Tang, Fan and Cao, Juan and Zhang, Yuxin and Kong, Xiaoyu and Li, Jintao and Deussen, Oliver and Lee, Tong-Yee},
  journal={arXiv preprint arXiv:2411.15034},
  year={2024}
}
      
-->