Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.
Schematic illustration of SILMM, comprising five steps: 1) LMMs generate compositional prompts by sampling based on provided instructions. 2) Diverse representations and images are generated using either discrete nucleus sampling or the proposed continuous DivDrop. 3) LMMs divide each compositional prompt into semantic units and generate questions for each unit. 4) VQA is conducted to answer these questions, with the answers and likelihoods aggregated into alignment scores as self-feedback. 5) For alignment tuning, DPO is applied for discrete LMMs, while the proposed KC-DPO is used for continuous LMMs.
Performance comparison and improvement of the proposed method for compositional text-to-image generation on T2I-CompBench++, DPG-Bench, and TIFA. Alignment scores are calculated using expert understanding models (e.g., VQA or object detection models) recommended by these benchmarks.
@article{qu2024silmm,
title={SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation},
author={Qu, Leigang and Li, Haochuan and Wang, Wenjie and Liu, Xiang and Li, Juncheng and Nie, Liqiang and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2412.05818},
year={2024}
}