AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes

John Dong1,2, Maomao Li1, Yunfei Liu1, Qin Guo2, Ailing Zeng1, Tianyu Yang1, Yu Li*1
1International Digital Economy Academy (IDEA), 2Peking University
*Corresponding Author

AddMe, allows for the insertion of customized portraits into any scene. You can drag the white line to see images before and after insertion. Left: Masked Image. Right: Generated Image.

Teaser Image

We present AddMe, an effective framework for inserting a new portrait into an existing image with just one reference face.

Abstract

While large text-to-image diffusion models have made significant progress in high-quality image generation, challenges persist when users insert their portraits into existing photos, especially group photos. Concretely, existing customization methods struggle to insert facial identities at desired locations in existing images, and it is difficult for existing local image editing methods to deal with facial details. To address these limitations, we propose AddMe, a powerful diffusion-based portrait generator that can insert a given portrait into a desired location in an existing scene image in a zero-shot manner. Specifically, we propose a novel identity adapter to learn a facial representation decoupled from existing characters in the scene. Meanwhile, to ensure that the generated portrait can interact properly with others in the existing scene, we design an enhanced portrait attention module to capture contextual information during the generation process. Our method is compatible with both text and various spatial conditions, enabling precise control over the generated portraits. Extensive experiments demonstrate significant improvements in both performance and efficiency.

Approach

Our proposed group-photo synthesizer AddMe comprises two key components: (1) a lightweight disentangled identity adapter, which is used to modulate the face of reference ID image and map it to an identity-consistent character portrait, and (2) an enhanced portrait attention (EPA) module that ensures the generated character portraits to have proper poses and clothing that harmonize with other characters in the context while seamlessly integrating the foreground and background. We then connect them through residual connections to form a plug-and-play human essence (HEM) module for T2I diffusion inpainting models.

Image0

Control via Mask shape & Text & Spatial Control

Image0

Various applications of AddMe. Our method supports different types of locaion mask, text prompt control, and spatial control.

Arbitrary Insertion

Image0

Our method supports the insertion of portraits at arbitrary locations in the target scene image.

Text Editability

Image0

Uncurated samples generated with text prompts. Our method can maintain facial identity fidelity while demonstrating excellent text-editing capabilities, allowing for free editing of clothing, poses, or introducing other objects.

AddMe-Data & AddMe-Benchmark

Image0

Driven by the proposed customized editing task, we create a large-scale high-quality multimodal human face dataset with instance-level annotations named AddMe-1.6M, which can be used for various customized portrait generation tasks. The proposed AddMe-1.6M dataset comprises a total of 2.3 million face-position-text pairs, included in 1.6 million scene images. These scene images consist of 1 million single-person and 600,000 multi-person scenes, providing ample data for training our model. Our benchmark comprises 500 target scene images and 20 reference ID images. These can be combined to generate 10,000 images for testing purposes.

Comparison with Existing Methods

Image0

Comparison with text-driven and exemplar-based methods on single portrait insertion. For text-driven methods, we use the name of the reference character in the prompt (such as Joe Biden). We also show setting comparison at the bottom.

Robustness

Image0

Visual ablation studies of various mask sizes in our method.

Image0

Uncurated samples of portrait variations created using different random seed. While maintaining facial fidelity, our method exhibits outstanding generative diversity, paving the way for potential applications such as virtual try-on.

BibTeX