PRISM: Long-Text-to-Image Generation via Compositional Prompt Decomposition

TL;DR: We decompose long prompts into manageable components, and compose their independent noise predictions to improve prompt-length generalization of pre-trained T2I models.

Abstract

While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity.

We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction.

We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.

The Challenge of Long Prompts Generalization

Large scale text-image datasets like LAION are dominated by concise, caption-like prompts, making T2I models trained on these datasets struggle to process prompts longer than their training data.

Existing methods attempt to bridge theis gap via two main strategies, both of which have their own trade-off:

Fine-Tuning: Directly tuning the model on long-captioned data is effective within the tuning lengths, but extrapolating to even longer prompts remains challenging. Moreover, tuned models risk "catastrophic forgetting" of their pre-trained knowledge.
Projecting: Mapping oversized inputs into the effective context window of pre-trained models. These methods introduce an information bottleneck and sacrifice the intricate details which make long prompts compelling in the first place.

Comparison of different Long-Text-to-Image generation strategies

Our Compositional Approach

Instead of forcing a model to process an out-of-distribution sequence length, our PRISM generalizes to long prompts compositionally.

Decomposing: PRISM uses a lightweight semantic decomposition module to extract constituent representations from the long-prompt encoding directly.
Parallel Processing: At each diffusion step, the pre-trained T2I model takes the extracted components as generation conditions, processing them in parallel into independent noise predictions.
Energy-Based Conjunction: These outputs are then merged into a single composite score using energy-based conjunction. The composed score leads the generation process toward the distribution that collectively satisfies all the semantic components.
Unsupervised Training: The decomposition module is optimized with both of the text encoder and image generative model keep frozen, learning to distribute the information within pre-trained capacities.

Our compositional Long-Text-to-Image generation pipeline overview

Improved Generalization with Prompt Lengths

While fine-tuned models' performances degrade sharply on prompts exceeding their training data, PRISM maintains robust performance across all tested prompt lenghts. Notably, PRISM outperforms the baselines by an average of 7.4% on prompts over 500 tokens.

Generation performance across various prompt lengths (higher the better)

This superior generalization is visualized when we gradually increase the prompt length. PRISM continues to load more scene details while the baseline stucks at the length of 200 tokens. Note that the two models are trained on the same dataset.

200 tokens 500 tokens

200 tokens

Long-Text-to-Image Generation Samples

PRISM is a fundamental generalization method and is also comparible with the mordern T2I architectures integrated with powerful LLM as their text encoders. PRISM allows the state-of-the-art models to accurately interprete the intricate paragraphs and render more details in the complex scene.

Comparison of image generation samples across state-of-the-art T2I models

BibTeX

@inproceedings{huang2026prism,
  title={Long-Text-to-Image Generation via Compositional Prompt Decomposition},
  author={Huang, Jen-Yuan and Lin, Tong and Du, Yilun},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

PRISM : Long-Text-to-Image Generation via Compositional Prompt Decomposition