TL;DR: We decompose long prompts into manageable components, and compose their independent noise predictions to improve prompt-length generalization of pre-trained T2I models.
While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity.
We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction.
We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.
Large scale text-image datasets like LAION are dominated by concise, caption-like prompts, making T2I models trained on these datasets struggle to process prompts longer than their training data.
Existing methods attempt to bridge theis gap via two main strategies, both of which have their own trade-off:
Comparison of different Long-Text-to-Image generation strategies
Instead of forcing a model to process an out-of-distribution sequence length, our PRISM generalizes to long prompts compositionally.
Our compositional Long-Text-to-Image generation pipeline overview
While fine-tuned models' performances degrade sharply on prompts exceeding their training data, PRISM maintains robust performance across all tested prompt lenghts. Notably, PRISM outperforms the baselines by an average of 7.4% on prompts over 500 tokens.
Generation performance across various prompt lengths (higher the better)
This superior generalization is visualized when we gradually increase the prompt length. PRISM continues to load more scene details while the baseline stucks at the length of 200 tokens. Note that the two models are trained on the same dataset.
PRISM is a fundamental generalization method and is also comparible with the mordern T2I architectures integrated with powerful LLM as their text encoders. PRISM allows the state-of-the-art models to accurately interprete the intricate paragraphs and render more details in the complex scene.
Comparison of image generation samples across state-of-the-art T2I models
We thank the anonymous reviewers for their valuable feedback.
@inproceedings{huang2026prism,
title={Long-Text-to-Image Generation via Compositional Prompt Decomposition},
author={Huang, Jen-Yuan and Lin, Tong and Du, Yilun},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}