CN-121982332-A - Multi-label image recognition method and system based on three-dimensional attention and dynamic grouping

CN121982332ACN 121982332 ACN121982332 ACN 121982332ACN-121982332-A

Abstract

The invention discloses a multi-label image recognition method and system based on three-dimensional attention and dynamic grouping, wherein the multi-label image recognition method respectively sends images to be recognized into a detail feature extraction module and a contour feature extraction module, bidirectional complementation of a feature extraction stage is realized through a dynamic feature interaction module, double-branch output features are sequentially calibrated through an attention calibration module (channel, space and label), an accurate feature matrix is obtained, the calibrated features are sent into a multi-scale feature fusion module, cross-scale fusion of different-level features is realized through feature splitting, independent convolution enhancement, cross-scale transverse connection fusion and global pooling and self-adaptive weighted integration output, the fusion features are subjected to label grouping dynamic classifier, overall parameter optimization through label semantic grouping, grouping feature mapping, dynamic weight distribution and multi-label joint prediction, and finally multi-label prediction results are output through an activation function.

Inventors

YAO LAN

Assignees

浙江理工大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A multi-label image recognition method based on three-dimensional attention and dynamic grouping is characterized by comprising the following steps: collecting an image to be identified, and preprocessing the image to be identified; The multi-label identification model comprises a double-branch feature extraction module, an attention calibration module, a multi-scale feature fusion module and a label grouping dynamic classification module; The double-branch feature extraction module comprises a basic feature extraction module and a feature enhancement module which are sequentially connected in series, wherein a detail feature extraction module and a contour feature extraction module are arranged in the basic feature extraction module in parallel, the detail feature extraction module is used for extracting detail features of an image to be identified from the image to be identified, the contour feature extraction module is used for extracting contour features from the image to be identified, the feature enhancement module unifies dimensions of the detail features and the contour features through a convolution layer to generate a detail projection weight matrix and a contour projection weight matrix, the detail projection weight matrix and the contour projection weight matrix are respectively fused to obtain contour enhancement features and detail enhancement features, and the generation of double-branch fusion features is realized through channel splicing; the attention calibration module is used for optimizing the double-branch fusion characteristics to obtain three-dimensional calibration characteristics; The multi-scale feature fusion module is used for processing and fusing the three-dimensional calibration features with different scales to obtain weighted feature vectors; the label grouping dynamic classification module is used for carrying out label classification according to the weighted feature vectors to obtain an output result of the multi-label recognition model; And identifying the label of the image to be identified by using the multi-label identification model.
2. The multi-label image recognition method based on the three-dimensional attention and dynamic grouping is characterized in that in a feature enhancement module, the dimensions of detail features and contour features are unified through a convolution layer to obtain detail intermediate features and contour intermediate features, the detail intermediate features and the contour intermediate features are projected to a contour space and a detail space respectively to generate a detail projection weight matrix and a contour projection weight matrix, the contour intermediate features and the detail intermediate features are fused with the detail projection weight matrix and the contour projection weight matrix respectively to obtain contour enhancement features and detail enhancement features, and the contour enhancement features and the detail enhancement features are fused to obtain double-branch fusion features output by a double-branch feature extraction module.
3. The multi-label image recognition method based on the three-dimensional attention and dynamic grouping of claim 1 is characterized in that the attention calibration module processes the double-branch fusion characteristics sequentially through a channel attention sub-module, a dimension reduction module, a space attention sub-module and a label attention sub-module, and performs layer normalization operation after fusing the processing result with the double-branch fusion characteristics to obtain three-dimensional calibration characteristics output by the attention calibration module, wherein the dimension reduction module is used for dimension unification of the outputs of the channel attention sub-module; In a tag attention sub-module, an incidence matrix is built based on the tag co-occurrence probability of a data set, element values of the incidence matrix are frequency duty ratios of tag pairs appearing at the same time, the incidence matrix is converted into tag embedded vectors through an embedding layer, cosine similarity of output features of the space attention sub-module and the tag embedded vectors is calculated to obtain a similarity matrix, global normalization and activation function processing are sequentially conducted on the similarity matrix to obtain a tag weight matrix, and the tag weight matrix and the output features of the space attention sub-module are multiplied channel by channel to obtain tag calibration features output by the tag attention sub-module.
4. A multi-label image recognition method based on three-dimensional attention and dynamic grouping is characterized in that in a channel attention sub-module, global average pooling and global maximum pooling are respectively used for processing double-branch fusion features to obtain an average feature map and an extreme value feature map, after the average feature map and the extreme value feature map are fused, two layers of multi-layer perceptrons and two activation functions are sequentially used for processing to obtain a channel weight vector, and channel weight vectors are multiplied with the double-branch fusion features channel by channel to obtain channel calibration features output by the channel attention sub-module.
5. The multi-label image recognition method based on three-dimensional attention and dynamic grouping according to claim 3 is characterized in that in a space attention sub-module, average pooling and maximum pooling are respectively carried out on channel calibration features to obtain an average space statistical matrix and a maximum space statistical matrix, after the two space statistical matrices are spliced, a transpose convolution, a convolution layer, a maximum pooling layer and an activation function are sequentially used for processing to obtain a space weight matrix, and space position-by-space position multiplication is carried out on the space weight matrix and the channel calibration features to obtain the space calibration features output by the space attention sub-module.
6. A multi-label image recognition method based on three-dimensional attention and dynamic grouping is characterized in that in a multi-scale feature fusion module, three-dimensional calibration features are split into shallow features, middle features and high-level features along a channel dimension, the shallow features, the middle features and the high-level features are subjected to feature enhancement by using independent convolution layers to obtain shallow enhancement features, middle enhancement features and high-level enhancement features, features before and after feature enhancement are added element by element to obtain shallow fusion features, middle fusion features and high-level fusion features, element addition is carried out on the shallow fusion features and the middle fusion features to obtain first fusion features, channel stitching is carried out on the first fusion features and the high-level fusion features to obtain second fusion features, and global pooling and self-adaptive weighting processing is carried out on the second fusion features to obtain weighted feature vectors.
7. The multi-label image recognition method based on the three-dimensional attention and dynamic grouping is characterized in that the label grouping dynamic classification module comprises a label grouping management module, a dynamic weight generation module, a grouping adaptation classification module, a feature optimization module and a feature fusion decision module, wherein the label grouping management module is used for outputting grouping rules and mapping tables of all labels according to semantics, the dynamic weight generation module is used for processing weighted feature vectors through a multi-layer perceptron and intra-group attention to generate grouping weights, the grouping adaptation classification module is used for configuring a special sub-classifier for each group of labels and outputting a three-dimensional intermediate feature matrix in parallel, the feature optimization module is used for processing the three-dimensional intermediate feature matrix to output probability values of each label, and the feature fusion decision module is used for performing binarization processing on the probability values of each label to obtain recognition results of a multi-label recognition model.
8. The method for identifying multi-label images based on three-dimensional attention and dynamic grouping of claim 6 wherein said sub-classifier comprises a first fully connected layer, a batch normalization layer, a first activation function, a second fully connected layer and a second activation function connected in sequence.
9. The multi-label image recognition method based on the three-dimensional attention and dynamic grouping is characterized by comprising the steps of training a multi-label recognition model by using a data set constructed by different target images, constructing a total loss function to guide model parameter updating in the training process, wherein the total loss function comprises category constraint loss, cross entropy loss and double-branch feature complementary loss, the category constraint loss is used for constraining global category distinction, the cross entropy loss is used for overcoming the problem of sample distribution imbalance of the multi-label, and the double-branch feature complementary loss is used for constraining bidirectional adaptability of detail features and semantic features.
10. A multi-label image recognition system based on three-dimensional attention and dynamic grouping is characterized by being used for executing the multi-label image recognition method based on three-dimensional attention and dynamic grouping, and comprises a double-branch feature extraction module, an attention calibration module, a multi-scale feature fusion module and a label grouping dynamic classification module, wherein the double-branch feature extraction module is used for extracting detailed features and outline features from an image to be recognized and carrying out feature enhancement post fusion, the attention calibration module is used for optimizing features extracted by the double-branch feature extraction module, the multi-scale feature fusion module is used for carrying out multi-scale division and feature enhancement post fusion on the optimized features, and the label grouping dynamic classification module is used for obtaining multi-labels corresponding to the image to be recognized according to an output result of the multi-scale feature fusion module.

Description

Multi-label image recognition method and system based on three-dimensional attention and dynamic grouping Technical Field The invention belongs to the technical field of computer image recognition, and particularly relates to a multi-label image recognition method and system based on three-dimensional attention and dynamic grouping, which can be widely applied to multi-label classification scenes in the fields of clothing, commodities, medical images and the like. Background Multi-label image identification is one of core tasks in the field of computer vision, aims to predict multiple labels for the same image, and is widely applied to scenes such as commodity classification, medical image diagnosis, intelligent monitoring and the like. Unlike traditional single-tag classification, multi-tag identification needs to deal with the problems of complex relevance among tags, unbalanced sample distribution and the like. Taking the e-commerce field as an example, the commodity image usually needs to be marked with multiple attributes such as class, style, material and the like. Because the clothing of the e-commerce platform mainly displays style effects by using pictures, the image data to be identified is explosively increased, and massive product pictures provide more choices for consumers, but also provide great challenges for commodity labeling. By means of traditional manual labeling, manpower and material resources are wasted, efficiency is low, and due to the fact that standards of labeling by merchants are not uniform, cognitive difference of different labeling persons and the like, the problem that labels are inconsistent with images is easily caused, under the background, research on an automatic identification technology of images to be identified becomes a necessary trend and an important means for solving the problem of labeling pictures of products of electronic commerce. At present, the automatic image recognition technology mainly adopts machine learning, namely, a rule is extracted from the existing data through an algorithm, so that a new sample prediction result is obtained. Machine learning can be further classified into conventional machine learning and deep learning. The traditional machine learning has the advantages of stronger pertinence, better prediction results under the condition of smaller sample size, certain difficulty in acquiring effective characteristics from original data, high familiarity of researchers in the field, namely threshold requirements for application of the traditional machine learning, and larger influence of factors such as environment, form, shooting view angle and the like of the sample on the traditional machine learning, and poorer generalization. Deep learning can overcome the defects, namely, the method has the advantages of automatic learning of data characteristics and classification, good generalization and the like. However, the existing deep learning network has the problems that a single-branch model is lost or has insufficient details, the two-dimensional attention of a space and a channel cannot be aligned with the semantics of a label, the cross-scale characteristic is unbalanced due to static fusion, and the like, so that the multi-label image recognition accuracy is low, and the multi-label image recognition cannot be directly applied. Disclosure of Invention The invention aims to provide a multi-label image recognition method and system based on three-dimensional attention and dynamic grouping. In a first aspect, the present invention provides a multi-label image recognition method based on three-dimensional attention and dynamic grouping, the method comprising: collecting an image to be identified, and preprocessing the image to be identified; The multi-label identification model comprises a double-branch feature extraction module, an attention calibration module, a multi-scale feature fusion module and a label grouping dynamic classification module; The double-branch feature extraction module comprises a basic feature extraction module and a feature enhancement module which are connected in series, wherein the basic feature extraction module comprises a detail feature extraction module and a contour feature extraction module which are parallel, the detail feature extraction module is used for extracting detail features of an image to be identified from the image to be identified, the contour feature extraction module is used for extracting contour features from the image to be identified, the feature enhancement module unifies dimensions of the detail features and the contour features through a convolution layer to generate a detail projection weight matrix and a contour projection weight matrix, the detail projection weight matrix and the contour projection weight matrix are respectively fused to obtain contour enhancement features and detail enhancement features, and the generation of double-branch fusion features is realized through channel