AddMe, allows for the insertion of customized portraits into any scene. You can drag the white line to see images before and after insertion. Left: Masked Image. Right: Generated Image.
While large text-to-image diffusion models have made significant progress in high-quality image generation, challenges persist when users insert their portraits into existing photos, especially group photos. Concretely, existing customization methods struggle to insert facial identities at desired locations in existing images, and it is difficult for existing local image editing methods to deal with facial details. To address these limitations, we propose AddMe, a powerful diffusion-based portrait generator that can insert a given portrait into a desired location in an existing scene image in a zero-shot manner. Specifically, we propose a novel identity adapter to learn a facial representation decoupled from existing characters in the scene. Meanwhile, to ensure that the generated portrait can interact properly with others in the existing scene, we design an enhanced portrait attention module to capture contextual information during the generation process. Our method is compatible with both text and various spatial conditions, enabling precise control over the generated portraits. Extensive experiments demonstrate significant improvements in both performance and efficiency.
Our proposed group-photo synthesizer AddMe comprises two key components: (1) a lightweight disentangled identity adapter, which is used to modulate the face of reference ID image and map it to an identity-consistent character portrait, and (2) an enhanced portrait attention (EPA) module that ensures the generated character portraits to have proper poses and clothing that harmonize with other characters in the context while seamlessly integrating the foreground and background. We then connect them through residual connections to form a plug-and-play human essence (HEM) module for T2I diffusion inpainting models.
Various applications of AddMe. Our method supports different types of locaion mask, text prompt control, and spatial control.
Our method supports the insertion of portraits at arbitrary locations in the target scene image.
Uncurated samples generated with text prompts. Our method can maintain facial identity fidelity while demonstrating excellent text-editing capabilities, allowing for free editing of clothing, poses, or introducing other objects.
Driven by the proposed customized editing task, we create a large-scale high-quality multimodal human face dataset with instance-level annotations named AddMe-1.6M, which can be used for various customized portrait generation tasks. The proposed AddMe-1.6M dataset comprises a total of 2.3 million face-position-text pairs, included in 1.6 million scene images. These scene images consist of 1 million single-person and 600,000 multi-person scenes, providing ample data for training our model. Our benchmark comprises 500 target scene images and 20 reference ID images. These can be combined to generate 10,000 images for testing purposes.
Comparison with text-driven and exemplar-based methods on single portrait insertion. For text-driven methods, we use the name of the reference character in the prompt (such as Joe Biden). We also show setting comparison at the bottom.
Visual ablation studies of various mask sizes in our method.
Uncurated samples of portrait variations created using different random seed. While maintaining facial fidelity, our method exhibits outstanding generative diversity, paving the way for potential applications such as virtual try-on.