CN-116843963-B - Image recognition method, device, equipment and storage medium

CN116843963BCN 116843963 BCN116843963 BCN 116843963BCN-116843963-B

Abstract

The disclosure provides an image recognition method, an image recognition device, image recognition equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to a scene of a smart city. The method comprises the steps of vectorizing a target image to obtain at least two first feature vectors, sequentially encoding the first feature vectors through each encoding module according to a sequence of encoding modules in a self-attention network model and a first fusion rule to obtain second feature vectors, wherein the encoding modules comprise at least two, the first fusion rule comprises the steps of fusing the feature vectors obtained by encoding the previous encoding module between at least one group of two adjacent encoding modules, the number of the fused feature vectors is smaller than that of the feature vectors obtained by encoding the previous encoding module, and determining the recognition result of the target image according to the second feature vectors. The present disclosure may significantly reduce the computational resource consumption of the image recognition process.

Inventors

NI ZIHAN
ZHANG CHENGQUAN
Yao Gun

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20230628

Claims (15)

1. An image recognition method, the method comprising: vectorizing the target image to obtain at least two first feature vectors; The first feature vector is coded by each coding module in sequence according to the sequence of the coding modules in the self-attention network model and a first fusion rule, so that a second feature vector is obtained; The first fusion rule comprises that at least one group of target positions between two adjacent coding modules are used for fusing feature vectors obtained by encoding a previous coding module in the two adjacent coding modules, the number of the fused feature vectors is smaller than that of the feature vectors obtained by encoding the previous coding module, and the fused feature vectors are input to a next coding module in the two adjacent coding modules for continuous encoding; determining a recognition result of the target image according to the second feature vector; The coding of the first feature vector is carried out by each coding module in sequence according to the sequence of the coding modules in the self-attention network model and a first fusion rule to obtain a second feature vector, and the method specifically comprises the following steps: Dividing the coding modules into at least two coding sets according to the sequence of the coding modules and the average number of all the coding modules, wherein each coding set comprises at least one coding module; Determining at least one group of two adjacent target coding sets from the coding sets; Taking the position between the coding module at the last position of the first target coding set and the coding module at the first position of the second target coding set in the two adjacent target coding sets as the target position for fusing the first characteristic vector; And respectively fusing the feature vectors obtained by encoding the previous target encoding set in the two adjacent target encoding sets at each target position according to a second fusing rule of the target positions to obtain second feature vectors, wherein the second fusing rule of each target position is a specific rule for fusing the feature vectors preset according to actual scene requirements, and represents that the feature vectors are fused at the target positions according to the preset rule.
2. The method of claim 1, the target locations comprising at least two of the adjacent two target locations, the width or height of the first feature vector of the second target location being one-half the width or height of the first feature vector of the first target location.
3. The method of any of claims 1-2, the target locations comprising at least two of the two adjacent target locations, a first target location being fused by width, a second target location being fused by height, or a first target location being fused by height and a second target location being fused by width.
4. The method according to any of claims 1-2, the second feature vector comprising at least two, the determining the recognition result of the target image from the second feature vector comprising: fusing the second feature vectors, wherein the number of the fused second feature vectors is 1; Generating a feature map corresponding to the target image according to the fused second feature vector; And identifying the feature map to obtain an identification result of the target image.
5. The method according to any one of claims 1-2, wherein the generating a feature map corresponding to the target image according to the fused second feature vector includes: and carrying out global average pooling on the second feature vector through a self-attention network model to obtain a feature map corresponding to the target image.
6. The method of any of claims 1-2, wherein the self-attention network model comprises any of an image classification model, an image detection model, and an image localization model.
7. An image recognition device, the device comprising: The vectorization unit is used for vectorizing the target image to obtain at least two first feature vectors; the coding unit is used for coding the first feature vector according to the sequence of the coding modules in the self-attention network model and the first fusion rule and sequentially passing through each coding module to obtain a second feature vector; The first fusion rule comprises that at least one group of target positions between two adjacent coding modules are used for fusing feature vectors obtained by encoding a previous coding module in the two adjacent coding modules, the number of the fused feature vectors is smaller than that of the feature vectors obtained by encoding the previous coding module, and the fused feature vectors are input to a next coding module in the two adjacent coding modules for continuous encoding; a determining unit, configured to determine a recognition result of the target image according to the second feature vector; The dividing unit is used for dividing the coding modules into at least two coding sets according to the sequence of the coding modules and the average number dividing mode of all the coding modules, and each coding set comprises at least one coding module; The determining unit is further configured to determine at least one group of two adjacent target coding sets from the coding sets; The determining unit is further configured to use a position between a last coding module of the first target coding set and a first coding module of the second target coding set in the two adjacent target coding sets as the target position for fusing the first feature vector; the coding unit is specifically configured to: And respectively fusing the feature vectors obtained by encoding the previous target encoding set in the two adjacent target encoding sets at each target position according to a second fusing rule of the target positions to obtain second feature vectors, wherein the second fusing rule of each target position is a specific rule for fusing the feature vectors preset according to actual scene requirements, and represents that the feature vectors are fused at the target positions according to the preset rule.
8. The apparatus of claim 7, the target locations comprising at least two, adjacent ones of the target locations, a first feature vector of a second target location having a width or height that is one-half of the width or height of the first feature vector of the first target location.
9. The apparatus of any of claims 7-8, the target locations comprising at least two of the two adjacent target locations, a first target location being fused by width, a second target location being fused by height, or a first target location being fused by height and a second target location being fused by width.
10. The apparatus according to any of claims 7-8, the second feature vector comprising at least two, the determining unit being specifically configured to: fusing the second feature vectors, wherein the number of the fused second feature vectors is 1; Generating a feature map corresponding to the target image according to the fused second feature vector; And identifying the feature map to obtain an identification result of the target image.
11. The apparatus according to any of claims 7-8, the determining unit being specifically configured to: and carrying out global average pooling on the second feature vector through the self-attention network model to obtain a feature map corresponding to the target image.
12. The apparatus of any of claims 7-8, the self-attention network model comprising any of an image classification model, an image detection model, an image localization model.
13. An electronic device includes at least one processor, and a memory communicatively coupled to the at least one processor; Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.

Description

Image recognition method, device, equipment and storage medium Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to a scene of a smart city, in particular to an image recognition method, an image recognition device, image recognition equipment and a storage medium. Background Deep self-attention transformation network (transducer) frameworks were first proposed for natural language processing, with transducers using self-attention mechanisms to capture global context information. In image recognition, a transducer-based code block may divide an input image into image blocks, and analogize the image blocks into one word (token) in a natural language processing task, generate a feature map by the token, and recognize the image based on the feature map. Currently, the image recognition method consumes a lot of computing resources. Disclosure of Invention The disclosure provides an image recognition method, an image recognition device, image recognition equipment and a storage medium, which can remarkably reduce the consumption of computing resources in an image recognition process. According to a first aspect of the present disclosure, there is provided an image recognition method, the method comprising: The target image is vectorized to obtain at least two first feature vectors, the first feature vectors are sequentially coded through each coding module according to a sequence of the coding modules in the self-attention network model and a first fusion rule to obtain second feature vectors, the coding modules comprise at least two, the first fusion rule comprises the steps of fusing feature vectors obtained by coding a previous coding module in at least one group of two adjacent coding modules at a target position between the two adjacent coding modules, the number of the fused feature vectors is smaller than that of feature vectors obtained by coding the previous coding module, the fused feature vectors are input to a next coding module in the two adjacent coding modules to be continuously coded, and the identification result of the target image is determined according to the second feature vectors. According to a second aspect of the present disclosure, there is provided an image recognition apparatus including a vectorization unit, an encoding unit, and a determination unit. The image recognition device comprises a vectorization unit, a coding unit and a determining unit, wherein the vectorization unit is used for vectorizing a target image to obtain at least two first feature vectors, the coding unit is used for coding the first feature vectors sequentially through each coding module according to a sequence of the coding modules in a self-attention network model and a first fusion rule to obtain second feature vectors, the first fusion rule comprises at least two coding modules, the first fusion rule comprises the steps of fusing the feature vectors obtained by coding the previous coding module in the two adjacent coding modules at a target position between at least one group of two adjacent coding modules, the number of the fused feature vectors is smaller than that of the feature vectors obtained by coding the previous coding module, the fused feature vectors are input to the next coding module in the two adjacent coding modules to be continuously coded, and the determining unit is used for determining recognition results of the target image according to the second feature vectors. According to a third aspect of the present disclosure there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect. According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the first aspect. According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein: fig. 1 is a schematic flow chart of an image recognition method according to an embodiment of the disclo