DiffBlender: Scalable and Composable

Multimodal Text-to-Image Diffusion Models


KAIST AI    NAVER WEBTOON AI    Yonsei University
Work done while interning at NAVER WEBTOON AI   Corresponding author

Figure 1. Generated images with multimodal conditions. By incorporating various types of input modalities (1st row), DiffBlender successfully synthesizes high-fidelity and diverse samples, aligned with user preferences (2nd row).

Abstract

In this study, we aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description, such as sketch, box, color palette, and style embedding, within a single model. We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types, i.e., image forms, spatial tokens, and non-spatial tokens. The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation. Notably, we achieve this without altering the parameters of the existing generative model, Stable Diffusion, only with updating partial components. Our study establishes new benchmarks in multimodal generation through quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender faithfully blends all the provided information and showcase its various applications in the detailed image synthesis.

Model Architecture

  • We design DiffBlender to facilitate the convenient addition of different modalities by categorizing them into three types: image-form, spatial tokens, and non-spatial tokens.
  • It is designed to intuitively extend to additional modalities while achieving a low training cost through a partial update of hypernetworks.

Figure 2. Overview of DiffBlender architecture. (a) illustrates the four types of conditions employed in DiffBlender and indicates where each part of information is used in the UNet layers. (b) focuses on the purple region in (a) to provide details of the DiffBlender's conditioning process. The lock-marked layers represent being fixed as the original parameters of SD. The remaining modules, small hypernetworks, denote the learnable parameters of DiffBlender.

Multimodal Text-to-Image Generation


Versatile applications of DiffBlender

DiffBlender enables flexible manipulation of conditions, providing the customized generation aligned with user preferences. Note that all results are generated by our single model at once, not in sequence.


Reference-guided and semantic-preserved generation


Object reconfiguration


Mode-specific guidance


Interpolating non-spatial conditions


Manipulating spatial conditions

Functional difference over previous works

Table below summarizes the comparisons with previous multimodal T2I diffusion models. The existing studies have limitations: support only one type of modality, involve training models with a substantial number of parameters, or support multimodal inference but require expensive cost for training independent networks. We highlight that DiffBlender supports various modality types, provides multimodal training, and selectively updates partial components, which facilitates the scaling-up of supporting modalities.

Evaluation with baselines

We set new standards for multimodal generation by conducting quantitative and qualitative comparisons with existing approaches. DiffBlender with multimodal conditions can achieve not only high fidelity but also reliable capability of multi-conditional generation, exhibiting high scores in YOLO, SSIM, and Depth.

BibTeX


@article{kim2023diffblender,
    title={DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models},
    author={Kim, Sungnyun and Lee, Junsoo and Hong, Kibeom and Kim, Daesik and Ahn, Namhyuk},
    journal={arXiv preprint arXiv:2305.15194},
    year={2023}
}