EP-4736030-A1 - CONTROLLABLE DIFFUSION MODEL BASED IMAGE GALLERY RECOMMENDATION SERVICE

EP4736030A1EP 4736030 A1EP4736030 A1EP 4736030A1EP-4736030-A1

Abstract

Aspects of the disclosure include methods and systems for leveraging a controllable diffusion model for dynamic image search in an image gallery recommendation service. An exemplary method can include displaying an image gallery having a plurality of gallery images and a dynamic image frame. The dynamic image frame can include a generated image and an interactive widget. The method can include receiving a user input in the interactive widget and generating, responsive to receiving the user input, an updated generated image by inputting, into a controllable diffusion model, the user input. The method can include replacing the generated image in the dynamic image frame with the updated generated image.

Inventors

XUAN, Hong
HUANG, LI
LI, Huangxing
CHEN, XI

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260506
Application Date: 20240530

Claims (20)

1. A method comprising: displaying an image gallery 214 comprising a plurality of gallery images 206, the image gallery 214 further comprising a dynamic image frame 216 comprising a generated image 108 and an interactive widget 218; receiving a user input 110 in the interactive widget 218; generating, responsive to receiving the user input 110, an updated generated image 108 by inputting, into a controllable diffusion model 102, the user input 110; and replacing the generated image 108 in the dynamic image frame 216 with the updated generated image 108.
2. The method of claim 1, further comprising receiving an image query in a field of the image gallery.
3. The method of claim 2, wherein the generated image is generated by inputting, into the controllable diffusion model, the image query.
4. The method of claim 2, wherein the plurality of gallery images and the generated image are selected according to a degree of matching to one or more features in the image query.
5. The method of claim 2, further comprising determining one or more constraints in the image query.
6. The method of claim 5, wherein the generated image is generated by inputting, into the controllable diffusion model, the one or more constraints.
7. The method of claim 5, wherein the one or more constraints in the image query comprise at least one of a pose skeleton and an object boundary.
8. The method of claim 7. wherein determining the one or more constraints comprises extracting the object boundary when a feature in the image query comprises one of a structure and a geological feature.
9. The method of claim 7, wherein determining the one or more constraints comprises extracting the pose skeleton when a feature in the image query comprises one of a person and an animal.
10. The method of claim 1, wherein the plurality of gallery images are sourced from an image database.
11. The method of claim 1, wherein the interactive widget comprises a text field, and receiving the user input in the interactive widget comprises receiving a text string input into the text field.
12. The method of claim 1, wherein the interactive widget comprises one or more of a dropdown menu, a checkbox, a slider, a color picker, a canvas interface for drawing or sketching, and a rating buton.
13. The method of claim 1, wherein the interactive widget comprises a canvas for magic wand inputs, and receiving the user input in the interactive widget comprises receiving a magic wand input graphically selecting one of a specific feature and a specific region in the generated image.
14. The method of claim 13, wherein the interactive widget further comprises a text field, and receiving the user input in the interactive widget further comprises receiving a text string input having contextual information for the magic wand input.
15. A system 202 having a memory 1104, computer readable instructions, and one or more processors 1002 for executing the computer readable instructions, the computer readable instructions controlling the one or more processors 1002 to perform operations comprising: receiving, from a client device 204 communicatively coupled to the system 202, an image query 104; providing, to the client device 204. a plurality of gallery images 206 and a generated image 108 according to a degree of matching to one or more features in the image query 104; receiving, from the client device 204, a user input 110; generating, responsive to the user input 110, an updated generated image 108 by inputting, into a controllable diffusion model 102, the user input 110; and providing, to the client device 204, the updated generated image 108.
16. The system of claim 15, wherein the generated image is generated by inputting, into the controllable diffusion model, the image query.
17. The system of claim 15, further comprising determining one or more constraints in the image query.
18. The system of claim 17, wherein the generated image is generated by inputting, into the controllable diffusion model, the one or more constraints.
19. A system 204 having a memory 1104, computer readable instructions, and one or more processors 1002 for executing the computer readable instructions, the computer readable instructions controlling the one or more processors 1002 to perform operations comprising: receiving, from an image gallery' recommendation service 202 communicatively coupled to the system 204, a plurality of gallery' images 206 and a generated image 108; displaying an image gallery 214 comprising the plurality of gallery’ images 206 and a dynamic image frame 216 comprising the generated image 108 and an interactive widget 218; receiving a user input 110 in the interactive widget 218; transmiting the user input 110 to the image gallery' recommendation service 202; receiving, from the image gallery recommendation service 202, an updated generated image 108; and replacing the generated image 108 in the dynamic image frame 216 with the updated generated image 108.
20. The system of claim 19, wherein the interactive widget comprises a text field, and receiving the user input in the interactive widget comprises receiving a text string input into the text field.

Description

CONTROLLABLE DIFFUSION MODEL BASED IMAGE GALLERY RECOMMENDATION SERVICE INTRODUCTION [0001] The subject disclosure relates to image search and recommendation systems, and particularly to leveraging a controllable diffusion model for dynamic image search in an image gallery recommendation service. [0002] Image gallery recommendation systems (also referred to as visual or image based discovery systems) play an increasingly crucial role in modem applications across a number of different domains, including e-commerce, social media, and entertainment. The primary goal of an image gallery recommendation system is to predict and recommend one or more relevant images that a user is likely to find interesting or appealing. To achieve this, image gallery recommendation systems leverage a variety of techniques, such as collaborative filtering, contentbased filtering, and deep learning, to provide personalized recommendations to users based on their characteristics, preferences, and behavior. [0003] Collaborative filtering involves analyzing the prior behavior and preferences of similar users to make recommendations. By examining the histoneal data of users who have similar tastes and preferences to a given user, the image gallery recommendation can more accurately identify images that are likely to be of interest to the user. Collaborative filtering can be either item-based, user-based, or both, where the former focuses on similarities between items (images) and the latter focuses on similarities between users. [0004] Content-based filtering refers to the analysis of the content and features of an image(s) themselves to make recommendations. Image gallery recommendation systems can extract relevant information from the images, such as color, texture, shape, and other visual attributes, and can use this extracted information to find similar images (using feature similarity, distance measures, etc.). By recommending images that are visually similar to the ones a user has already shown interest in, content-based filtering aims to capture the user's preferences based on image characteristics. [0005] Deep learning techniques, such as convolutional neural networks (CNNs), Variational Autoencoders (VAEs), and transformer networks have revolutionized image recommendation systems. CNNs can leam intricate patterns and features (hierarchical representations) from images by processing them through multiple layers of interconnected neurons. By training on large datasets, these networks can capture complex relationships (local and global image features) and make accurate predictions about user preferences based on image content. [0006] VAEs are generative models that can learn a compact representation (latent space) of input data. In the context of image recommendation, VAEs can leam a low-dimensional representation of images that captures the underlying structure and variations in the dataset. By leveraging this latent space, VAEs can generate new, diverse images that align with user preferences, enhancing the recommendation capabilities of an image gallery recommendation service. [0007] Transformer networks were originally designed for natural language processing tasks but have found to excel in a range of other applications, such as in computer vision, including in image recommendation. Transformers model long-range dependencies and capture contextual information in data. In image gallery recommendation systems, transformer networks can be utilized to leam complex contextual relationships between images and to generate more accurate recommendations based on this contextual information. [0008] Image gallery recommendation systems can also rely on user behavior data (user interactions) to enhance user satisfaction, engagement, and the overall user experience. In terms of user interaction, image gallery recommendation systems can offer several ways for users to engage with the system. For example, in implementations where a user(s) interacts with the image gal 1 ery recommendation system through a user interface, such as a mobile app or website, the user can be presented with an initial curated set of images. The user can then interact with the system by viewing images (e.g., scrolling through a collection of recommended images), liking/disliking images, saving images, sharing images (e.g., via a coupled social media platform), and/or otherwise interacting positively or negatively with one or more images in the gallery. These user interactions can be used as feedback to the system to better understand the user's tastes and to refine future recommendations. [0009] It is important to note that while user interactions can play a significant role in training and refining image gallery recommendation systems, these interactions are somewhat limited — notably, users do not have direct control over the underlying algorithms and model parameters of an image gallery recommendation system. While a system can leam from aggregated user data