CN-121982144-A - Clothing accessory local replacement method, medium and electronic equipment

CN121982144ACN 121982144 ACN121982144 ACN 121982144ACN-121982144-A

Abstract

The invention provides a local replacement method, medium and electronic equipment for clothing accessories, belonging to the technical field of computers and networks, wherein the method utilizes SDXL-inpainting and SAM model to construct weak supervision triplet data, thereby enhancing space positioning and boundary learning; and PositionIC synthesis strategy is introduced to expand paired data to construct consistency triplet data, so that scale robustness and position signal perception are improved. In training, MOE-C-LoRA realizes differential learning of multi-modal features through expert branches and a router mechanism, and combines the attention and attention condition position masks of conditional multi-modal DiT to realize feature decoupling and accurate fusion. The training adopts two-stage joint loss, namely the first stage emphasizes denoising, the background keeps balance with an expert, the position accuracy and the boundary smoothness are ensured, and the second stage emphasizes consistency and perception constraint, and the appearance and visual sense of the accessory are improved. The method can be widely applied to scenes such as virtual fitting, electronic commerce display, digital person generation, garment design and the like, and has higher application value and popularization potential.

Inventors

MA YINGHUI
CHEN XIN

Assignees

青岛三态比特科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260104

Claims (10)

1. A local replacement method for clothing accessories is characterized by comprising the following steps: S1, acquiring paired clothing model diagrams with accessories and clothing model diagrams without paired accessories from a database, and performing text description on the screened images by using a large language model to serve as training data; s2, manually uploading a clothing model image needing to be subjected to clothing accessory local replacement, a clothing accessory figure needing to be replaced and corresponding text description in a system by a user, and inputting the clothing model image, the accessory figure needing to be replaced and the corresponding text description into the system after preprocessing, so that the clothing model image is stored in the system to facilitate subsequent operation; S3, positioning corresponding accessory positions of a garment model image without paired accessories by utilizing a pre-trained SAM model, generating an accessory condition mask image, simultaneously carrying out trace diffusion redrawing on an accessory area by utilizing a pre-trained SDXL-inpainting model based on the mask image to remove identifiable accessory marks, recording the accessory mark as a garment model image with fuzzy accessory boundary, and forming weak supervision triplet data pairs by utilizing the full mask accessory condition mask image and the garment model image with the original accessory In addition, using PositionIC synthesis strategy, single accessory sample is randomly placed on different boundaries or angles in the model diagram of the garment with accessory to synthesize training data pairs, generating 'synthetic pairing' data with position signals to expand a small amount of pairing data, and using SAM model to act on the model diagram to generate position mask diagram of accessory, and the consistency triplets of the model diagram of the garment with accessory and the composition of the mask diagram of accessory are recorded as To increase position diversity and scale robustness; S4, performing secondary training by using a two-stage hybrid expert condition LoRA fine tuning strategy based on a Flux diffusion transducer architecture, wherein weak supervision triplet data is adopted in a first stage, so that spatial position information and boundary fusion information of accessories of a model learning are fused, consistency triplet data is used in a second stage, the model learning is promoted to generate accessory results with consistent appearance in a designated area, namely, a target is learned to be positioned by using unpaired data, and then the target consistency is learned by using a small amount of paired data; S5, in the training process, a mixed expert condition LoRA fine tuning strategy abandons the traditional thought of grafting on a certain specific layer to carry out low-rank matrix training, a conditional multi-mode DiT attention method and an attention condition position masking mechanism are innovatively introduced in the low-rank matrix training of all experts, wherein the conditional multi-mode DiT attention method is used for enabling each conditional branch to keep a dedicated receptive field so as to avoid mutual interference among a plurality of conditions and effectively expand among a plurality of conditions and single condition input, and the attention condition position masking mechanism enables a model to more clearly divide appearance texture information and geometric position information in a feature space so as to accurately realize local replacement, reduce error coverage and visual artifacts, and further repeat the same accessory at different positions or more stably share the appearance information among different accessories; S6, carrying out parameter training and back propagation on a two-stage mixed expert condition LoRA fine tuning strategy by adopting a joint loss function, wherein the loss comprises diffusion denoising loss, consistency loss, background maintenance loss, expert load balance loss and pixel perception loss, and respectively carrying out combination and acting on the two-stage training process to optimize the whole diffusion process; S7, loading the model weights and the network structures of the multi-condition mixing experts LoRA and DiT which are trained in the S6, fusing the basic models of the experts LoRA and DiT, acquiring a clothing model diagram and an accessory diagram which are input by a user and need to be partially replaced and corresponding text description from a system, smearing masks on the clothing model image to clearly define a space region replaced by the accessory image, and inputting the space region into a DiT method which is loaded; And S8, in the reasoning process, the clothing model image, the accessory image and the drawn mask image input by the user are respectively fed into a VAE module to respectively extract main image features, accessory image features and position coding features, then text description input by the user is extracted through a T5 text encoder module, the extracted main image features, accessory image features and position coding features are unified to perform position coding, the coded accessory image features and position coding features are added to form final condition features, the final condition features are input into a DiT framework together, the noise-removed clothing model image latent space feature results are gradually estimated, and then the corresponding clothing model image replacement accessory results are output by using VAE decoding.
2. The method for locally replacing clothing accessories according to claim 1, wherein the specific operation steps of acquiring the clothing image data in the step S1 are: S11, selecting images containing classified vocabulary entries as 'accessory' and 'model' from a clothing database, matching the images through image IDs, dividing successfully matched image combinations into paired clothing model images with accessory, and dividing the model images with accessory, which cannot be successfully matched, into clothing model images without accessory pairs; And S12, respectively storing the paired clothing model drawings with the accessory and the clothing model drawings without the accessory so as to facilitate the following data preprocessing step.
3. The method for local replacement of a clothing accessory according to claim 1, wherein the method for preprocessing the data set in step S3 is as follows: S31, selecting a clothing model diagram without paired accessories, positioning the spatial position of the accessory on the model diagram by using a pre-trained SAM model, and generating an accessory condition mask diagram; S32, carrying out local micro-diffusion redrawing of a graphic drawing on an accessory region of the clothing model drawing by utilizing the accessory condition mask drawing guide SDXL-inpainting model so as to remove identifiable accessory marks, and marking an output result as the clothing model drawing with fuzzy accessory boundaries; s33, combining and pairing the original clothing model diagram with the accessory, the generated clothing model diagram with the accessory with the blurred boundary and the full mask accessory condition mask diagram before SDXL-inpainting model processing to form a weak supervision triplet data pair ; S34, carrying out data enhancement on the clothing model graph with the accessory based on PositionIC synthetic strategies, and randomly placing the accessory pattern on different boundaries and angles to form synthetic training data pairs with the original clothing model graph; S35, dividing and positioning the positions of the accessories of the garment model images in the matched garment model image groups by adopting a SAM model to obtain position mask images of the accessories, combining the expanded garment model images with the accessories and the accessory mask images thereof, and marking the consistency triples as Expansion of paired image data is realized; S36, marking clothing model images in the consistency triad and the weak supervision triad respectively by introducing a thousand questions 2.5 of a large multi-mode language model, and acquiring corresponding text description prompt words by requiring the large model to describe key elements such as wearing styles, accessory wearing positions, accessory elements, clothing features and the like of the model as detailed as possible; S37, using weak supervision triplet data for the first stage training of a two-stage hybrid expert condition LoRA fine tuning strategy to promote the target blind positioning capability of the model to form a space positioning and synthesizing priori, using consistent triplet data for the second stage training to promote the consistency of model accessory textures, enabling the post-replacement accessory and target reference heights to be consistent, finally integrating text prompt words and image data to respectively obtain input data of the first stage, wherein the input data are expressed as Input data of the second stage, expressed as And is divided into a training set, a verification set and a test set according to the ratio of 8:1:1.
4. The method for partially replacing clothing accessories according to claim 1, wherein the method for training LoRA by using the two-stage hybrid expert condition LoRA fine tuning strategy in step S5 is as follows: The method is divided into two stages, wherein in the first stage, a special space positioning expert and other LoRA experts are mainly used for learning how to redraw the accessory points on the unpaired images and naturally fuse the accessory points, the stage is a weakly supervised training process mainly used for improving the space positioning capability of the model, in the second stage, the other expert models LoRA are trained to keep the consistency of textures and shapes in the accessory mask area of the clothing model diagram and the input accessory diagram, and meanwhile, a conditional multimodality DiT attention method (CMMDiTA) and an attention condition position mask mechanism (ACPM) are introduced in the training process of the two stages, so that condition sensing and area limiting are carried out on the attention, and the model is promoted to obtain multi-condition expansibility and accurate local control.
5. The method for partial replacement of clothing accessories according to claim 1, wherein the method for attention of conditional multimodality DiT in step S5 is as follows: for feature vectors from model images, text hints, accessory and mask maps: Wherein, the Is the first The conditional encoding vector of each accessory consists of weighted summation of the feature vector of the accessory map and the position feature of the mask map, and firstly, the input feature sequence is processed Partitioning of query matrices from body garment model drawings and text The matrix is defined as a token that allows global access, guarantees consistency in the whole, and accordingly queries the matrix from the orchestration conditions Defined as having access only to its corresponding local sub-sequence features The method comprises the steps of (1) avoiding mutual interference among different accessory conditions, then, applying multi-mode attention, fusing text, image main body and all condition information characteristics for a global branch, calculating the attention for each accessory condition independently for a conditional branch, ensuring that characteristic alignment is only carried out in a local range, and finally, splicing results from the global branch and the conditional branch and inputting the results into a next module through a layer of linear projection.
6. The method for local replacement of clothing accessories according to claim 1, wherein the method for performing two-stage model optimization by combining the loss function in step S6 is as follows: The model training process is divided into two stages, in the first stage, a positioning expert LoRA and other LoRA experts are mainly trained to make the model learn how to regenerate natural fusion of the accessory elements and boundaries at key accessory positions without paired images, mainly through joint diffusion denoising loss Loss of background retention Load balancing loss with expert Optimizing the model, in a second stage, training the other blending expert LoRA to ensure high consistency with the target trim, removing noise loss, mainly by joint spread Consistency loss Loss of background retention Pixel perception loss And expert load balancing loss And the optimization of the model is realized, and the stable learning of the conditional image information is ensured.
7. The method for local replacement of clothing accessories according to claim 1, wherein the method for fusing the model weights and the network structures of the trained multi-condition hybrid experts LoRA and DiT in step S7 is as follows: Firstly loading pre-trained DiT trunk model weights including original weights of all the converters, simultaneously loading trained MOE-C-LoRA weights, storing corresponding low-rank decomposition matrixes for each conditional expert LoRA, then adopting a dynamic condition fusion strategy, calculating dynamic gating coefficients according to input condition characteristics during reasoning, dynamically selecting and weighting the expert LoRA by a router, then replacing the fused weights to corresponding layers in the DiT model, storing the fused overall weights as new model check points for being called during reasoning, finally loading model network results and model weights fused with LoRA expert parameters, and gradually diffusing the model to generate results by utilizing weight calculation attention and characteristic update under the action of CMMDiTA and ACPM modules.
8. The method for local replacement of clothing accessories according to claim 1, wherein in step S7, the method for reasoning by using the fused DiT according to the clothing model diagram, the accessory diagram, the custom mask diagram and the text description input by the user is as follows: S71, firstly, a clothing model diagram and a decoration diagram which are uploaded by a user are needed, then, the clothing model diagram is coated with a custom mask area in a system to determine the upper body area of the decoration diagram and obtain the mask diagram, and meanwhile, preprocessing operations are carried out on the input clothing model diagram, the decoration diagram and the mask diagram to ensure that image data are suitable for subsequent model operation, and in addition, related text description input by the user is transferred into the system for subsequent use; S72, combining and inputting the preprocessed image and text description in parallel into a Flux model subjected to weight correction by using MOE-C-LoRA, using a masking mechanism of CMMDiTA in an attention layer, firstly mapping a clothing model image into latent space noise through a VAE encoder, then extracting image feature vectors of an accessory map through a CLIP visual encoder, then encoding the mask map into position vectors through a position encoder, secondly converting text description into semantic feature vectors through a T5 text encoder, and finally weighting and adding the image feature vectors and the position encoding vectors, so that noise influence range and accessory appearance features can be effectively controlled in attention calculation; s73, inputting the features obtained in the S72 into DiT architecture, gradually estimating the latent space representation of the denoising version features of the corresponding clothing model diagram in the latent space from pure noise by the system, wherein the features are subjected to continuous fine adjustment by applying different experts LoRA, in addition, the stability of the feature space positions is ensured by adopting a conditional multimodality DiT attention method (CMMDiTA), and the consistency of the accessory features is ensured by adopting an attention condition position masking mechanism (ACPM); s74, inputting the latent space representation of the clothing model image obtained in S73 to a pre-trained VAE decoder module for decoding and reconstructing to obtain a finally output clothing model image, wherein the model image comprises images with high-quality edge fusion, clear texture structure and consistent appearance after target accessory replacement in a designated position area, and the target accessory replacement task designated by a user is realized; And S75, outputting a clothing model image result with the target accessory generated by corresponding user input to a user interface for real-time preview by the system, and exporting and storing the clothing model image result into JPG, PNG or PDF formats and the like, so that the secondary editing capability and the application flexibility of the image are improved.
9. A computer readable medium on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the clothing accessory partial replacement method according to any one of claims 1to 8.
10. An electronic device, comprising: One or more processors; A storage system for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the garment accessory partial replacement method of any one of claims 1 to 8.

Description

Clothing accessory local replacement method, medium and electronic equipment Technical Field The invention belongs to the technical field of computers and networks, and particularly relates to a local clothing accessory replacement method, a medium and electronic equipment. Background With the rapid development of electronic commerce and virtual fitting technologies, consumers are no longer satisfied with the traditional fitting experience of "whole-garment replacement", but want to accurately and naturally replace local clothing accessories (such as glasses, earrings, waistbands, necklaces, etc.) on the same model body to realize highly personalized and scenic wearing previews. However, the prior art still faces three major bottlenecks on the task, namely firstly, difficulty in multi-condition coupling, difficulty in simultaneously processing multi-mode input of a text prompt, an accessory image and an accurate position mask by a traditional method (such as ControlNet, IP-Adapter) which usually only supports a single control signal (such as text or mask), inconsistency between a local replacement area and a global style, element and position conflict, namely, the existing UNet-based diffusion model lacks fine-granularity space control capability, problems such as accessory dislocation, element drift or adjacent area interference are easy to occur, particularly, problems are more remarkable when multiple accessories are replaced simultaneously, finally, data dependence is strong, high-quality model-accessory-mask-result quadruple data needed by model training is sparse, a large number of paired training samples are needed by the traditional supervision method, and the diversified overlapping requirements in a real scene are difficult to cover. In recent years, the rise of the diffusion Transformer (DiT) provides a systematic solution to this bottleneck, pushing the task of local replacement of garments from laboratory to clothing industry entrance, will provide critical traffic and business precedent for live electronics, fast fashion design, etc. platforms. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a local replacement method for clothing accessories, a medium and electronic equipment. In order to achieve the above purpose, the present invention provides the following technical solutions: in one aspect, the present invention provides a method for locally replacing a clothing accessory, comprising the steps of: S1, acquiring paired clothing model diagrams with accessories and clothing model diagrams without the paired accessories from a database, wherein the paired clothing model diagrams with the accessories are 5000 pairs in total, each pair of clothing model diagrams with the accessories respectively comprise one accessory diagram and corresponding clothing model diagram with the accessories 20000 pieces, the screened images are subjected to text description (comprising model figure proportion, accessory elements, clothing characteristics and the like) by using a large language model, and 30000 images and 25000 text description prompt words are taken as training data; s2, manually uploading clothing model images and accessory drawings to be replaced and corresponding text descriptions (the number of the accessory drawings input by a user is not more than 3) in the system by a user, and inputting the images into the system after preprocessing to be convenient for subsequent operation; S3, positioning corresponding accessory positions of 20000 garment model pictures without paired accessories by using a pre-trained SAM model, generating an accessory condition Mask picture, simultaneously carrying out trace diffusion redrawing on an accessory area by using a pre-trained SDXL-inpainting model based on the Mask to remove identifiable accessory marks, recording the accessory boundary blurred garment model picture, and forming weak supervision triplet data pairs by using the full Mask accessory condition Mask picture (full-Mask) with the garment model picture with the accessory boundary blurred In addition, using PositionIC synthetic strategy, randomly putting single accessory sample in 5000 pairs of garment model graph with accessory on different boundary or angle to synthesize training data pair, generating 'synthetic paired' data with position signal to expand small amount of paired data, using SAM model to act on garment model graph to generate position mask graph of accessory, total data quantity of extended garment model graph with accessory being 10000 pairs, marking consistency triplets of garment model graph with accessory and its composition asTo increase position diversity and scale robustness; S4, performing secondary training by using a Flux-based diffusion transducer architecture (Diffusion in Transformer, diT) through a two-stage mixed expert condition LoRA fine tuning strategy (MOE-C-LoRA), adopting weak supervision triplet data in a first sta