DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

1University of Southern California, 2Harvard University
3Microsoft Research Asia, 4Microsoft Research Redmond
*=equal contribution as second author, =equal contribution

DreamDistribution finds a prompt distribution of reference images, which can then be used to generate new 2D/3D instances, capable of text-guided editing and more.

Abstract

The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment.

Method

Method

We keep a set of K learnable soft prompts and model a distribution of them at the CLIP text encoder feature space. Only prompts are learnable, CLIP encoder and the T2I diffusion model are all fixed. We use a reparameterization trick to sample from the prompt distribution and update the learnable prompts through backpropagation. The training objective is to make the generated images aligns with the reference image. An additional orthogonal loss is incorporated to promote differentiation among learnable prompts. For inference, we similarly sample from the prompt distribution at text feature space to guide the pretrained T2I generation.

Comparison with Other Methods

Comparison Result

Given a set of training images (typically 5-20, we only show 4 here), we compare generation results with other existing methods. We use Stable Diffusion version 2.1 for all methods. As can be seen on the bottom row, our method is able to generate more diverse and coherent images.

Diverse In-distribution 2D Generation

Reference Images

Reference Images

Generated Result

Generated Images

More results of generated image using reference images. Images in the left columns are reference images, and images in the right columns are generated images.

Text-guided In-distribution 2D Generation

Diverse In-distribution Instance Generation with Text-guided Editing

Results on text-editability of our methods. Left column shows samples of reference images, right columns are generated results with corresponding prompts.

Application on Text-to-3D Generation

3D generation results by learning a prompt distribution over the reference images and then inference using MVDream (without extra texts).

Application on Text-to-3D Generation with Text-guided Editing

3D generation results by learning a prompt distribution over the reference images and then inference with text-guided editing using MVDream.

Composition of Prompt Distributions

Composition 1

Composition 2

Composition 3

Composition of prompt distributions using linear interpolation between two sets of images. Mixing ratio changes linearly from left to right. The middle columns show mixtures of two sets.

Scaling Variance of Prompt Distribution

Scaling 1

Scaling 2

Scaling 3

Effect of scaling the variance of a learned prompt distribution, by multiplying variance with a scalar γ. Image diversity increases as the scaling factor γ increases.

Synthetic ImageNet

Synthetic Imagenet

Comparison between training images generated using our method and image generated using solely class names as text prompts. For each group of two rows, the top row shows samples of our generated training images, and the bottom row shows the training images generated using class names as text prompts.