Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference.
In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.
Our unified framework employs a Multimodal Diffusion Transformer (MM-DiT) with MoE layers for efficient, dynamic task handling. We introduce hierarchical task semantic annotation and a novel semantic-aligned router. This router guides the MoE's specialization by aligning its routing decisions with these explicit task semantics in an interpretable manner.
We design a novel semantic-aligned gating network to force the model's internal routing strategy (encoded as a routing signature "g") to predict the task's macroscopic semantics (encoded as a semantic embedding "s"). This predictive alignment serves as a bridge, connecting local routing decisions with global task intent.
@article{xu2025tagmoe,
title={TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts},
year={2025}
}