CN-121982762-A - Image recognition model training method, apparatus, computer readable storage medium and computer program product

CN121982762ACN 121982762 ACN121982762 ACN 121982762ACN-121982762-A

Abstract

The application relates to an image recognition model training method, an image recognition model training device, a computer readable storage medium and a computer program product. The method comprises the steps of verifying the identification performance of an original face analysis model according to a verification sample set to obtain the identification performance of the original face analysis model under a plurality of preset candidate visual semantic types, inquiring in a preset attribute tag library according to the visual semantic types to be optimized to obtain selectable semantic attribute tags under the visual semantic types to be optimized, replacing attribute tags of samples to be enhanced according to the selectable semantic attribute tags under the visual semantic types to be optimized, using the replaced attribute tags as image semantic attribute constraints, using the original spatial layout of the samples to be enhanced as image spatial layout constraints, and generating enhanced training data corresponding to the samples to be enhanced through a preset diffusion generation model. By adopting the method, the accuracy of face analysis can be improved.

Inventors

LIU TING
CHE FEI
TANG YOUBIN
QU XIAOCHAO
LIU LUOQI

Assignees

厦门美图之家科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260127

Claims (10)

1. A method for training an image recognition model, the method comprising: performing recognition performance verification on the original face analysis model according to the verification sample set to obtain the recognition performance of the original face analysis model under a plurality of preset candidate visual semantic types; Selecting a sample to be enhanced from the verification sample set according to the identification performance and a preset performance condition, and determining the visual semantic type to be optimized of the original face analysis model from the candidate visual semantic types; Inquiring in a preset attribute tag library according to the visual semantic type to be optimized to obtain selectable semantic attribute tags under the visual semantic type to be optimized, wherein the face semantic tag library comprises selectable semantic attribute tags respectively corresponding to the plurality of candidate visual semantic types; performing attribute tag replacement on the sample to be enhanced according to the selectable semantic attribute tag under the visual semantic type to be optimized to obtain a replaced attribute tag; Generating enhanced training data corresponding to the sample to be enhanced through a preset diffusion generation model by taking the replaced attribute tag as an image semantic attribute constraint and taking the original space layout of the sample to be enhanced as an image space layout constraint; And optimizing the original face analysis model according to the enhanced training data to obtain an optimized face analysis model.
2. The method of claim 1, wherein the enhanced training data comprises an enhanced face image, the original face parsing model comprises a multi-type semantic segmentation network for performing face parsing tasks, the replacing the attribute tag of the sample to be enhanced according to the selectable semantic attribute tag under the visual semantic type to be optimized to obtain a replaced attribute tag comprises: Replacing the original attribute label of the sample to be enhanced under the visual semantic type to be optimized from at least one selectable semantic attribute label under the visual semantic type to be optimized to obtain a replaced attribute label; the generating the enhanced training data corresponding to the sample to be enhanced by using the replaced attribute tag as an image semantic attribute constraint and the original space layout of the sample to be enhanced as an image space layout constraint through a preset diffusion generation model comprises the following steps: generating a natural language text instruction for describing the enhanced image content according to the replaced attribute tags and other attribute tags of the sample to be enhanced; generating an enhanced face image through the diffusion generation model by taking the natural language text instruction as a semantic guiding condition and taking a real segmentation mask corresponding to the sample to be enhanced as a space layout condition; and pairing the enhanced face image with the real segmentation mask corresponding to the sample to be enhanced to obtain the enhanced training data.
3. The method according to claim 2, wherein generating the enhanced face image by the diffusion generation model using the natural language text instruction as a semantic guidance condition and using the real segmentation mask corresponding to the sample to be enhanced as a spatial layout condition includes: inputting the natural language text instruction to a text encoder of the diffusion generation model to obtain a text embedded vector representing the whole semantic content of the image; Inputting the real segmentation mask into a control network which is parallel to a main generation network of the diffusion generation model, so as to encode the real segmentation mask through the control network, and obtaining a feature map for representing the spatial layout and the outline of each semantic type in the real segmentation mask; Inputting the text embedding vector into the main generation network, and injecting the feature map into a corresponding level of the main generation network through a zero initialization convolution layer; and carrying out iterative denoising on the main generation network to obtain the enhanced face image which is aligned with the real segmentation mask pixels and matches with the natural language text instruction description.
4. The method of claim 3, wherein the recognition performance comprises an average cross-over ratio, the preset performance condition comprises an average cross-over ratio threshold, the selecting a sample to be enhanced from the set of verification samples according to the recognition performance and the preset performance condition, and determining a visual semantic type to be optimized of the original face parsing model from the candidate visual semantic types comprises: for each candidate visual semantic type, calculating the average merging ratio of the original face analysis model to the candidate visual semantic type based on a prediction segmentation mask of the original face analysis model on a verification sample set and a real segmentation mask of the verification sample set; And sequencing all the candidate visual semantic types according to the average cross ratio from low to high, and determining at least one candidate visual semantic type which is the forefront in sequencing and is smaller than the average cross ratio threshold value as the visual semantic type to be optimized in the current iterative denoising period.
5. The method of claim 4, wherein, before replacing the original attribute tags of the sample to be enhanced under the visual semantic type to be optimized with the at least one selectable semantic attribute tag under the visual semantic type to be optimized to obtain replaced attribute tags, the method further comprises: Aiming at each station verification sample image containing the visual semantic type to be optimized in a verification sample set, calculating the single image merging ratio of the current verification sample image on the visual semantic type to be optimized; Sorting all verification sample images containing the visual semantic types to be optimized in the verification sample set according to the single image intersection ratio from low to high; and determining the sample to be enhanced according to the preset proportion or the fixed number of the verification sample images which are in front in the sorting.
6. The method according to claim 1, wherein the performing recognition performance verification on the original face analysis model according to the verification sample set to obtain recognition performance of the original face analysis model under a plurality of preset candidate visual semantic types includes: inputting the verification sample images in the verification sample set into the original face analysis model to obtain a prediction segmentation mask of the original face analysis model for each verification sample image; And calculating the average cross-over ratio between the prediction segmentation mask and the real segmentation mask of all verification sample images in the verification sample set under the candidate visual semantic types aiming at each candidate visual semantic type to obtain the identification performance index of the original face analysis model under the candidate visual semantic types.
7. The method according to claim 1, wherein before the querying in the preset attribute tag library according to the visual semantic type to be optimized to obtain the selectable semantic attribute tag under the visual semantic type to be optimized, the method further comprises: Analyzing a face image in original training data of the original face analysis model by utilizing a visual language model, and identifying the semantic attribute labels of image areas corresponding to candidate visual semantic types in the image, wherein the visual language model comprises a dense attribute prediction model; and carrying out association storage on the identified attribute tags according to the belonging candidate visual semantic types to obtain the face semantic tag library.
8. An image recognition model training apparatus, the apparatus comprising: The verification module is used for verifying the identification performance of the original face analysis model according to the verification sample set to obtain the identification performance of the original face analysis model under a plurality of preset candidate visual semantic types; the determining module is used for selecting a sample to be enhanced from the verification sample set according to the identification performance and a preset performance condition, and determining the visual semantic type to be optimized of the original face analysis model from the candidate visual semantic types; The query module is used for querying in a preset attribute tag library according to the visual semantic type to be optimized to obtain selectable semantic attribute tags under the visual semantic type to be optimized, wherein the face semantic tag library comprises the selectable semantic attribute tags respectively corresponding to the plurality of candidate visual semantic types; the replacement module is used for replacing the attribute label of the sample to be enhanced according to the selectable semantic attribute label under the visual semantic type to be optimized to obtain a replaced attribute label; The generation module is used for generating the reinforced training data corresponding to the sample to be reinforced through a preset diffusion generation model by taking the replaced attribute tag as an image semantic attribute constraint and taking the original space layout of the sample to be reinforced as an image space layout constraint; and the optimization module is used for optimizing the original face analysis model according to the enhanced training data to obtain an optimized face analysis model.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Image recognition model training method, apparatus, computer readable storage medium and computer program product Technical Field The present application relates to the field of image processing technology, and in particular, to an image recognition model training method, an image recognition model training device, a computer readable storage medium and a computer program product. Background With the rapid development of computer vision and artificial intelligence technology, the face analysis (FACE PARSING) technology has become a core basic technology in many fields such as digital entertainment, virtual reality, medical cosmetology and the like. The face analysis aims at carrying out pixel-level fine semantic segmentation on the face image and precisely dividing tens of visual semantic components such as skin, hair, five sense organs and even ornaments. The output of the mask is a high-quality segmentation mask, and the mask is a basic stone for realizing advanced applications such as beauty, virtual make-up, expression driving and the like. The privacy and the labeling difficulty of the face image are considered, and high-quality labeling face data are difficult to acquire. In addition, the real face data set has the conditions of long tail data, a large number of difficult samples and the like, so that the number and the quality of training samples of the existing face analysis model are not ideal, and the accuracy of the face analysis model is influenced. Disclosure of Invention Based on this, the application provides an image recognition model training method, an image recognition model training device, a computer readable storage medium and a computer program product, which can optimize the sample quality of a face analysis model. In one aspect, the present application provides an image recognition model training method, the method comprising: performing recognition performance verification on the original face analysis model according to the verification sample set to obtain the recognition performance of the original face analysis model under a plurality of preset candidate visual semantic types; Selecting a sample to be enhanced from the verification sample set according to the identification performance and a preset performance condition, and determining the visual semantic type to be optimized of the original face analysis model from the candidate visual semantic types; Inquiring in a preset attribute tag library according to the visual semantic type to be optimized to obtain selectable semantic attribute tags under the visual semantic type to be optimized, wherein the face semantic tag library comprises selectable semantic attribute tags respectively corresponding to the plurality of candidate visual semantic types; performing attribute tag replacement on the sample to be enhanced according to the selectable semantic attribute tag under the visual semantic type to be optimized to obtain a replaced attribute tag; Generating enhanced training data corresponding to the sample to be enhanced through a preset diffusion generation model by taking the replaced attribute tag as an image semantic attribute constraint and taking the original space layout of the sample to be enhanced as an image space layout constraint; And optimizing the original face analysis model according to the enhanced training data to obtain an optimized face analysis model. In some embodiments, the enhanced training data includes an enhanced face image, the original face analysis model includes a multi-type semantic segmentation network for performing a face analysis task, and the replacing the attribute tag of the sample to be enhanced according to the selectable semantic attribute tag under the visual semantic type to be optimized to obtain a replaced attribute tag includes: Replacing the original attribute label of the sample to be enhanced under the visual semantic type to be optimized from at least one selectable semantic attribute label under the visual semantic type to be optimized to obtain a replaced attribute label; the generating the enhanced training data corresponding to the sample to be enhanced by using the replaced attribute tag as an image semantic attribute constraint and the original space layout of the sample to be enhanced as an image space layout constraint through a preset diffusion generation model comprises the following steps: generating a natural language text instruction for describing the enhanced image content according to the replaced attribute tags and other attribute tags of the sample to be enhanced; generating an enhanced face image through the diffusion generation model by taking the natural language text instruction as a semantic guiding condition and taking a real segmentation mask corresponding to the sample to be enhanced as a space layout condition; and pairing the enhanced face image with the real segmentation mask corresponding to the sample to be enhanced to obtain the enhanced trainin