CN-119963922-B - Image recognition method, device, equipment and storage medium

CN119963922BCN 119963922 BCN119963922 BCN 119963922BCN-119963922-B

Abstract

The application relates to the technical field of image recognition, and discloses an image recognition method, device, equipment and storage medium. The method comprises the steps of obtaining an image to be identified, carrying out feature encoding on the image to be identified to obtain a block encoding representation of an image block, carrying out pixel relevance feature extraction on the image to be identified through the block encoding representation based on a multi-head self-attention mechanism to obtain an image attention weight representation and an image feature representation of the image to be identified, carrying out feature indexing on the image attention weight representation to obtain an attention weight index representation containing information with strongest attention intensity and weakest attention intensity, carrying out feature indexing on the image feature representation based on the attention weight index representation to obtain an image feature index representation, and carrying out image identification based on the image feature index representation to obtain an image identification result. The embodiment of the application can improve the precision of fine-grained image recognition by considering the characteristics of the areas with strong attention intensity and the areas with weak attention intensity.

Inventors

ZENG XINGWEN
YANG CHAOXIANG
WU YANGFENG
HOU KAIWEI

Assignees

北京浠谷科技有限公司

Dates

Publication Date: 20260512
Application Date: 20250211
Priority Date: 20250108

Claims (9)

1. An image recognition method, comprising: acquiring an image to be identified, wherein the image to be identified comprises a plurality of image blocks; performing feature coding on the image to be identified to obtain a block coding representation of the image block; Based on a multi-head self-attention mechanism, extracting pixel association degree characteristics of the image to be identified through the block coding representation to obtain image attention weight representation and image characteristic representation of the image to be identified; The attention weight index representation comprises a block attention weight representation set with the strongest attention intensity and a block attention weight representation set with the weakest attention intensity in the image attention weight representation; performing feature indexing on the image feature representation based on the attention weight index representation to obtain an image feature index representation; Performing image recognition based on the image feature index representation to obtain an image recognition result; the feature indexing of the image feature representation based on the attention weight index representation to obtain an image feature index representation comprises: Indexing the tile feature representations in the image feature representation based on the attention weight index representation to obtain a first target feature representation and a second target feature representation, wherein the first target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the strongest attention intensity, and the second target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the weakest attention intensity; And splicing the first target feature representation and the second target feature representation to obtain the image feature index representation.
2. The method of claim 1, wherein the feature encoding the image to be identified to obtain a tile encoded representation of the image block comprises: Performing linear mapping on the image block to obtain a linear mapping vector; and performing position embedding on the linear mapping vector to obtain the block coding representation.
3. The image recognition method according to claim 1, wherein the performing pixel association feature extraction on the image to be recognized through the tile code representation based on the multi-head self-attention mechanism to obtain an image attention weight representation and an image feature representation of the image to be recognized includes: Determining a tile attention weight representation of the tile encoded representation based on a multi-headed attention mechanism; splicing and compressing the attention weight representations of all the image blocks to obtain the attention weight representation of the image; and performing feature mapping processing on the image attention weight representation to obtain the image feature representation.
4. The image recognition method according to claim 1, wherein the feature indexing the image attention weight representation to obtain an attention weight index representation includes: accumulating the attention weight representation of the image to obtain attention weight accumulating and multiplying representation; The attention weight cumulative multiplication representation indexes the block attention weight representation with the strongest attention intensity and the block attention weight representation with the weakest attention intensity in the attention weight cumulative multiplication representation to obtain the attention weight index representation.
5. The image recognition method of claim 1, wherein the concatenating the first target feature representation and the second target feature representation to obtain the image feature index representation comprises: Performing weighting operation on the first target feature representation by using a first learnable parameter and performing activation operation to obtain a first intermediate parameter, and performing weighting operation on the second target feature representation by using a second learnable parameter and performing activation operation to obtain a second intermediate parameter; calculating a first weight parameter and a second weight parameter based on the first intermediate parameter and the second intermediate parameter, constructing a value interval of the first weight parameter and the second weight parameter, and determining the value of the first weight parameter and the second weight parameter; and carrying out weighted fitting operation on the first target feature representation and the second target feature representation by adopting the first weight parameter and the second weight parameter to obtain the image feature index representation.
6. The image recognition method according to claim 1, wherein the performing image recognition based on the image feature index representation to obtain an image recognition result includes: Normalizing the image feature index representation to obtain an image feature sequence; Adding a classification head to the image feature sequence to construct an image classification feature representation; and performing multi-label classification on the image classification characteristic representation to obtain the image recognition result.
7. An image recognition apparatus, comprising: the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring an image to be identified; The second module is used for carrying out feature coding on the image to be identified to obtain a block coding representation of the image block; The third module is used for extracting pixel association degree characteristics of the image to be identified through the block coding representation based on a multi-head self-attention mechanism to obtain image attention weight representation and image characteristic representation of the image to be identified; The attention weight index representation comprises a block attention weight representation set with the strongest attention intensity and a block attention weight representation set with the weakest attention intensity in the image attention weight representation; a fifth module, configured to perform feature indexing on the image feature representation based on the attention weight index representation, to obtain an image feature index representation; A sixth module, configured to perform image recognition based on the image feature index representation, to obtain an image recognition result; the feature indexing of the image feature representation based on the attention weight index representation to obtain an image feature index representation comprises: Indexing the tile feature representations in the image feature representation based on the attention weight index representation to obtain a first target feature representation and a second target feature representation, wherein the first target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the strongest attention intensity, and the second target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the weakest attention intensity; And splicing the first target feature representation and the second target feature representation to obtain the image feature index representation.
8. An electronic device comprising a memory storing a computer program and a processor implementing the image recognition method of any one of claims 1 to 6 when the computer program is executed by the processor.
9. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the image recognition method of any one of claims 1 to 6.

Description

Image recognition method, device, equipment and storage medium Technical Field The present application relates to the field of image recognition technologies, and in particular, to an image recognition method, apparatus, device, and storage medium. Background Image classification can be divided into coarse granularity and fine granularity. Fine-grained recognition is an important branch of the computer vision field, aiming at accurately distinguishing different categories in which there are fine differences in images. In the related art, a deep learning method such as a regional proposal network (Region Proposal Networks, RPN) can be utilized to better complete a fine-grained recognition task. However, the current image recognition method is too focused on the area of strong attention in the image, and ignores the area of weak attention in the image, which may contain important classification characteristic information missed by the strong attention part, resulting in reduced fine-granularity recognition accuracy. Disclosure of Invention The present application aims to provide an image recognition method, apparatus, device and storage medium, aiming to improve the accuracy of fine-grained image recognition by considering the characteristics of both a region of strong attention and a region of weak attention. The embodiment of the application provides an image identification method, which comprises the following steps: acquiring an image to be identified, wherein the image to be identified comprises a plurality of image blocks; performing feature coding on the image to be identified to obtain a block coding representation of the image block; Based on a multi-head self-attention mechanism, extracting pixel association degree characteristics of the image to be identified through the block coding representation to obtain image attention weight representation and image characteristic representation of the image to be identified; The attention weight index representation comprises a block attention weight representation with the strongest attention intensity and a block attention weight representation with the weakest attention intensity in the image attention weight representation; performing feature indexing on the image feature representation based on the attention weight index representation to obtain an image feature index representation; and carrying out image recognition based on the image feature index representation to obtain an image recognition result. In some embodiments, the feature encoding the image to be identified to obtain a tile encoded representation of the image block includes: Performing linear mapping on the image block to obtain a linear mapping vector; and performing position embedding on the linear mapping vector to obtain the block coding representation. In some embodiments, the multi-head self-attention mechanism is used for extracting pixel association degree characteristics of the image to be identified through the block coding representation to obtain an image attention weight representation and an image characteristic representation of the image to be identified, and the method comprises the following steps: Determining a tile attention weight representation of the tile encoded representation based on a multi-headed attention mechanism; splicing and compressing the attention weight representations of all the image blocks to obtain the attention weight representation of the image; and performing feature mapping processing on the image attention weight representation to obtain the image feature representation. In some embodiments, the feature indexing the image attention weight representation to obtain an attention weight index representation includes: accumulating the attention weight representation of the image to obtain attention weight accumulating and multiplying representation; And indexing the block attention weight representation with the strongest attention intensity and the block attention weight representation with the weakest attention intensity in the attention weight cumulative multiplication representation to obtain the attention weight index representation. In some embodiments, the feature indexing the image feature representation based on the attention weight index representation to obtain an image feature index representation includes: Indexing the tile feature representations in the image feature representation based on the attention weight index representation to obtain a first target feature representation and a second target feature representation, wherein the first target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the strongest attention intensity, and the second target feature representation is composed of the tile feature representations corresponding to the tile attention weight representations with the weakest attention intensity; And splicing the first target featur