CN-122023548-A - Customized image generation method, system, medium, terminal and program product based on large model fusion

CN122023548ACN 122023548 ACN122023548 ACN 122023548ACN-122023548-A

Abstract

The application provides a customized image generation method, a system, a medium, a terminal and a program product based on large model fusion, wherein the method comprises the steps of fusing acquired sample image data and text data to obtain a cross-modal fusion vector; the method comprises the steps of inputting cross-modal fusion vectors into a Stable distribution model and a CycleGAN model for coding to obtain quality vectors and style vectors respectively, fusing the quality vectors and the style vectors based on an attention mechanism, inputting the fused quality vectors and the style vectors into a generated countermeasure network model for training to obtain a customized image generation model, deploying the customized image generation model, converting text data into corresponding image data according to the image style of a sample image, and outputting the customized image. The method can meet the personalized requirements and depth customization demands of the users on image creation, and provides a more intelligent, flexible and innovative image generation solution for the users.

Inventors

LU WENYUAN
ZENG CHUANMING

Assignees

上海企创信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20241101

Claims (10)

1. The customized image generation method based on large model fusion is characterized by comprising the following steps: Fusing the acquired sample image data and text data to obtain a cross-modal fusion vector; Inputting the cross-modal fusion vector into a Stable diffration model and a CycleGAN model respectively for coding so as to obtain a quality vector and a style vector respectively; fusing the quality vector and the style vector based on an attention mechanism, and inputting the fused quality vector and style vector into a generated countermeasure network model for training so as to obtain a customized image generation model; And the customized image generation model converts the text data into corresponding image data according to the image style of the sample image and then outputs the customized image.
2. The method of claim 1, wherein the fusing the acquired sample image data and text data to obtain the cross-modal fusion vector comprises: sample image data and text data are respectively obtained; Extracting features of the sample image data by using a graph neural network model to obtain image features; extracting features of the text data by using a transducer model to obtain text semantic features; And fusing the image features and the text semantic features based on a trans-former model to obtain a trans-modal fusion vector.
3. The method for generating a customized image based on large model fusion as recited in claim 1, further comprising fine-tuning the generated customized image based on user interaction behavior characteristics acquired in real time.
4. The method for generating a customized image based on large model fusion according to claim 3, wherein the fine-tuning the generated customized image according to the user interaction behavior characteristics acquired in real time comprises: preprocessing the user interaction behavior characteristics acquired in real time to convert the user interaction behavior characteristics into a data format suitable for the customized image generation model; and inputting the user interaction behavior data after format conversion into the customized image generation model, and adjusting the customized image generated by the customized image generation model.
5. The method for generating a customized image based on large model fusion according to claim 4, wherein the preprocessing mode comprises data deduplication, missing value processing, outlier processing and standardization processing.
6. The method for generating a customized image based on large model fusion as recited in claim 1, further comprising collecting user feedback data of the generated customized image and inputting the data into the generated countermeasure network model to optimize the customized image generation model.
7. A customized image generation system based on large model fusion, comprising: The cross-modal fusion module is used for fusing the acquired sample image data and the acquired text data to obtain a cross-modal fusion vector; The multi-model collaborative module is used for respectively inputting the cross-mode fusion vector into a Stable diffration model and a CycleGAN model for coding to respectively obtain a quality vector and a style vector; The customized image generation module is used for deploying the customized image generation model, and the customized image generation model converts the text data into corresponding image data according to the image style of the sample image and then outputs the customized image.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 6.
9. A computer program product comprising computer program code means for causing a computer to carry out the method according to any one of claims 1 to 6 when said computer program code means are run on the computer.
10. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any one of claims 1 to 6.

Description

Customized image generation method, system, medium, terminal and program product based on large model fusion Technical Field The application relates to the technical field of artificial intelligence, and more particularly to a method, system, medium, terminal, and program product for generating customized images based on large model fusion. Background In the current field of picture generation technology, although significant progress has been made, most techniques are still limited to single modality input, e.g., accepting only pictures or only text as input. This single modality input approach is difficult to adequately capture the overall needs of the user, resulting in a generated picture that does not meet the user's requirements. Meanwhile, the existing picture generation technology has certain limitations in the aspects of processing complex scenes, keeping consistency of details and generating creative content. Accordingly, there is a need for a method, system, medium, terminal and program product for generating customized images based on large model fusion, which solve the above-mentioned problems in the prior art. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a method, a system, a medium, a terminal and a program product for generating a customized image based on large model fusion, which are used for solving the technical problem that the prior art is difficult to capture the overall demands of users. To achieve the above and other related objects, a first aspect of the present application provides a method for generating a customized image based on large model fusion, including: Fusing the acquired sample image data and text data to obtain a cross-modal fusion vector; Inputting the cross-modal fusion vector into a Stable diffration model and a CycleGAN model respectively for coding so as to obtain a quality vector and a style vector respectively; fusing the quality vector and the style vector based on an attention mechanism, and inputting the fused quality vector and style vector into a generated countermeasure network model for training so as to obtain a customized image generation model; And the customized image generation model converts the text data into corresponding image data according to the image style of the sample image and then outputs the customized image. In some embodiments of the first aspect of the present application, the process of fusing the acquired sample image data and text data to obtain a cross-modal fusion vector includes acquiring sample image data and text data, respectively, performing feature extraction on the sample image data using a neural network model to obtain image features, performing feature extraction on the text data using a Transformer model to obtain text semantic features, and fusing the image features and the text semantic features based on a cross-modal attention mechanism of the Transformer model to obtain a cross-modal fusion vector. In some embodiments of the first aspect of the present application, the method further comprises fine-tuning the generated customized image according to user interaction behavior features acquired in real-time. In some embodiments of the first aspect of the present application, the fine-tuning the generated customized image according to the user interaction behavior feature acquired in real time includes preprocessing the user interaction behavior feature acquired in real time to convert it into a data format suitable for the customized image generation model, inputting the user interaction behavior data after format conversion into the customized image generation model, and adjusting the customized image generated by the customized image generation model. In some embodiments of the first aspect of the present application, the preprocessing includes data deduplication, missing value processing, outlier processing, and normalization processing. In some embodiments of the first aspect of the present application, the method further comprises collecting user feedback data for the generated customized image and inputting into the generated countermeasure network model to optimize the customized image generation model. To achieve the above and other related objects, a second aspect of the present application provides a customized image generation system based on large model fusion, including: The cross-modal fusion module is used for fusing the acquired sample image data and the acquired text data to obtain a cross-modal fusion vector; The multi-model collaborative module is used for respectively inputting the cross-mode fusion vector into a Stable diffration model and a CycleGAN model for coding to respectively obtain a quality vector and a style vector; The customized image generation module is used for deploying the customized image generation model, and the customized image generation model converts the text data into corresponding im