CN-121982458-A - Multimode fusion fashion image editing method based on improved MGD

CN121982458ACN 121982458 ACN121982458 ACN 121982458ACN-121982458-A

Abstract

The invention discloses a multimode fusion fashion image editing method based on an improved MGD, and relates to the technical field of deep learning and multimode learning. According to the characteristics of multimode fusion fashion image editing, the MGD model is improved in two aspects, namely, vision Transformer model is added to extract image characteristics, and a local flow global analysis method is introduced to improve the accuracy of fashion image editing. The method comprises the key steps of firstly obtaining fashion images and corresponding text descriptions related to fashion image editing, then preprocessing the fashion images and text data, including standardization and feature extraction, then matching the fashion images with the generated text descriptions through a comparison learning method so as to optimize the representation capacity of a model, then collecting human body posture point information, generating corresponding feature images, then adopting a multi-modal fusion method to combine the features of the fashion images, the text description data and the human body posture points, and finally evaluating and verifying the trained model to realize multi-modal fusion fashion image editing. In summary, the invention can provide reliable support for fashion image editing applications.

Inventors

ZOU YANG
LIAO SHIYU

Assignees

河海大学

Dates

Publication Date: 20260505
Application Date: 20251105

Claims (6)

1. A multimode fusion fashion image editing method based on improved MGD (media gateway device) enables an original MGD model to edit fashion images more accurately by improving network structure, parameter setting and algorithm. The method is characterized by mainly comprising the following key steps of image and text data acquisition, data preprocessing and data enhancement, text-image data pair construction, feature extraction model construction, text-image feature splicing, human body posture point feature extraction architecture construction, fashion image editing model training, model evaluation and tuning.
2. The multimode fused fashion image editing method based on improved MGD of claim 1, wherein image and text data collection is performed by collecting existing published fashion image datasets and corresponding text description data over a network.
3. The multimode fusion fashion image editing method based on the improved MGD of claim 1, wherein the data preprocessing and data enhancement are performed on the collected fashion images and text data, the key steps are as follows: and 3.1, deleting a large amount of collected fashion images and text data, and retaining the fashion images and text data which are most in line with the editing requirements of the multimode fusion fashion images. 3.2 Uniformly adjusting the reserved fashion images to be of a fixed size, and scaling the pixel values to the [0,1] range through normalization processing. 3.3, The text data is required to be subjected to word segmentation processing, and sentences are split into word or sub-word units and converted into corresponding embedded vectors. And 3.4, matching the preprocessed text description data with the fashion image data to construct a text-fashion image data set.
4. The multimode fusion fashion image editing method based on the improved MGD according to claim 1, wherein the feature extraction encoder of the fashion image and the text channel is improved based on the MGD model infrastructure, and finally the feature extraction of the fashion image and the text input data is realized, and the edited fashion image is generated, the key steps are as follows: 4.1 for fashion image channels, vision Transformer (ViT) models are used as the architecture of the image encoder instead of the traditional convolutional neural network models, and fashion image features are mapped to embedded vectors with fixed dimensions through a full connection layer so as to perform contrast learning in the same embedded space with the output of the text channels. 4.2 For the text channel, standardized text metadata is used as input, a CLIP model is used as a framework of a text encoder, pooling operation is carried out on output of a transducer to generate text vector representation with fixed length, and the pooled text vector is mapped to the same vector space as that of fashionable dress image embedding through a full connection layer, so that embedding alignment among different modes is ensured. 4.3, Processing the fashion image and text data through an image encoder and a text encoder respectively, obtaining embedded representation of the fashion image and text, and then effectively learning and optimizing the similarity between different modes by calculating a similarity matrix and symmetrical cross entropy loss. 4.4 Adopting a potential diffusion model to replace the diffusion model as a framework for generating an image, performing diffusion and denoising in a low-dimensional potential space, generating the image from random noise through a gradual denoising process, calculating the CLIP similarity between the currently generated image and the target text description in each denoising process, and guiding the generating process by taking the similarity as an additional loss function, wherein the CLIP model is utilized to map the text description into an image embedding space to serve as a condition input of the generating model. And 4.5, obtaining the multi-mode combined characteristic representation through the connection operation on the characteristic representations output by the text channel and the fashion image channel.
5. The multimode fusion fashion image editing method based on the improved MGD of claim 1, wherein the key steps of collecting human body posture point information in the fashion image and generating a human body posture point feature map, and finally realizing posture control of the generated image are as follows: 5.1 extracting human body posture point information, extracting human body key point information from the input fashion image by using a human body posture detection algorithm OpenPose and outputting two-dimensional coordinates of the key points. And 5.2, generating a human posture feature map, and converting the extracted key point information into the feature map, namely, representing the connection relations among the key points by lines to generate an image containing the connection relations. And 5.3, transmitting the preprocessed feature map to a control Net model as a condition input, so that the gesture control of the generated image is realized, and the generated image is combined with the text-fashion image feature. And 5.4, adopting a local flow global analysis (LFGP) method, respectively deforming clothing parts through local flows, and combining local deformation results by using global clothing analysis, so as to obtain more accurate fashion images.
6. A multi-modal fusion fashion image editing method based on improved MGD as claimed in claim 1, wherein based on improved MGD model training, the model is developed mainly around contrast learning, enabling the model to zoom in on associated fashion images and text in shared embedding space while pushing away uncorrelated ones by maximizing semantic alignment between fashion images and text modalities. The key steps are as follows: And 6.1, dividing a text-fashion image data set into a training set, a verification set and a test set according to a certain proportion, wherein the training set is used for training a model, the verification set is used for evaluating the training effect of the model, overfitting is prevented, and the test set is used for detecting the generalization capability of the model. And 6.2, initializing model parameters, inputting a training set and a verification set, training and optimizing the model parameters.

Description

Multimode fusion fashion image editing method based on improved MGD Technical Field The invention relates to the technical field of deep learning and multi-mode learning, in particular to a multi-mode fusion fashion image editing method based on improved MGD. . Background Today, the economic society develops to drive on a fast traffic lane, the aesthetic level of the public is increasingly improved, and the demand for fashion is also rapidly increased. Fashion is organic expression of culture and art, intensively reflects life pursuits of modern people and thought trend of modern society, and has been developed into an important industry on a global scale. Fashion encompasses clothing, accessories, shoe bags, make-up and the like, wherein clothing is the core of fashion. As a necessity for life, clothing is becoming fashion items showing individuality and improving personal image with the development of times, and fashion clothing (abbreviated as "fashion") is also becoming one of the main sources of people's fashion consumption. In recent years, the rapid progress of information technology and the increasing popularity of the internet accelerate the trend of digitization and intellectualization of fashion industry. Researchers in computer vision and multimedia are rapidly focusing on this area, and a series of leading edge exploration is being conducted. Thanks to deep learning, neural network and other technologies, around fashion, especially fashion, researchers have achieved a series of achievements in the tasks of clothing classification, clothing retrieval, clothing detection, clothing image segmentation, virtual changing and the like, and have promoted the development and progress in the field of digital fashion. Currently, the remarkable trend of fashion consumption development is individualization, diversification and differentiation, more and more consumers begin to combine own preference and individuality expression, and try individualization customization of fashion, and 'private customization' is developed from high-end consumption of the masses to fashion selection of the masses. Thanks to the progress of the internet and man-machine interaction technology, consumers can easily and conveniently participate in personalized fashion design. The difficulty with fashion customization is how clearly and accurately a consumer expresses the needs and ideas of a garment design. A fashion customizing service driven by a user has the modes that a clothing customizing merchant firstly provides a physical photo of fashion of a certain style to a consumer as a prototype template for reference, then the consumer performs fashion style design around the fashion physical graph by drawing a design sketch, editing on the physical graph and the like, and finally shows design results to the merchant and gives them the production and production entity fashion. However, fashion design is a highly specialized task, and it is difficult for most consumers who do not receive the training of clothing design to express their own needs according to the prototype photo of the clothing, and to design and edit the clothing satisfying their own preferences. In addition, in the design process, consumers cannot view the physical effects of the clothes designed and modified in real time. Therefore, how to provide a more convenient and natural fashion image editing (fashion editing) method for users in the personalized fashion customization process, meets the clothing design requirement with users as the center, and becomes a new research hotspot in the digital fashion field. The CLIP (Contrastive Language-Image Pre-training) model is a multi-modal contrast learning model aimed at achieving semantic alignment of images and text through Pre-training. The model consists of an image encoder and a text encoder, which are respectively used for extracting the characteristics of the image and the text and generating a high-dimensional embedded vector. During training, CLIP uses a contrast learning strategy to maximize the similarity of positive sample pairs to image-text and minimize the similarity of negative sample pairs, so that semantically matched images and text are embedded closer in shared vector space, while non-matched ones are embedded farther. Models typically employ InfoNCE losses to optimize the alignment effect between images and text, enabling CLIP to have strong multi-modal alignment capability in image-text retrieval, image classification, and other tasks. The diffusion model is a powerful generation model, generates images matched with text description from random noise in a step-by-step denoising mode, supports generation of multi-mode data, can generate images with high quality and rich details, and has higher semantic consistency. Local Flow Global Parsing (LFGP) is an innovative morphing module, unlike traditional global morphing mechanisms. When facing complex input, such as complex human body posture and clothing