CN-115329114-B - Image retrieval method based on attention enhancement and automatic coding fusion

CN115329114BCN 115329114 BCN115329114 BCN 115329114BCN-115329114-B

Abstract

The invention discloses an image retrieval method based on attention enhancement and automatic coding fusion, which uses an improved ResNet network to extract global feature mapping and local feature mapping, obtains global feature descriptors and local feature descriptors according to the global feature mapping and the local feature mapping, calculates image similarity, and obtains a target image through similarity comparison. The image retrieval method improves the traditional residual block, uses automaton coding, effectively unifies local features and global features into one network, extracts the more focused region through a focusing mechanism, avoids the spending of an algorithm, and has higher retrieval speed and higher accuracy.

Inventors

WANG ZHIXIAO
WANG XIN
ZHANG JIULONG
Qu Xiaoe

Assignees

西安理工大学

Dates

Publication Date: 20260505
Application Date: 20220706

Claims (3)

1. The image retrieval method based on attention enhancement and automatic coding fusion is characterized in that an improved ResNet network is used for extracting global feature mapping and local feature mapping, global feature descriptors and local feature descriptors are obtained according to the global feature mapping and the local feature mapping, image similarity is calculated, and a target image is obtained through similarity comparison; The method is implemented according to the following steps: step 1, improving a traditional ResNet model, training a residual network, forming a network backbone, and obtaining a global branch network and a local branch network through the network backbone; The step1 is specifically implemented according to the following steps: step 1.1, improving the residual blocks of ResNet, wherein the arrangement sequence of the residual blocks is a batch normalization layer, then a convolution layer and finally a ReLU activation function layer; Step 1.2, training a residual network ResNet network on a GoogleLandmark data set, wherein the iteration number is 100k, the learning rate is 1e-3, the weight attenuation is 0.0005, and training is completed to obtain a global branch network and a local branch network; Step 2, giving an image, obtaining two feature maps through the two branch networks in the step 1, namely a global feature map and a local feature map, and extracting a depth activation feature D in the global feature map and a shallow activation feature S in the local feature map; Step 3, the depth activation features D extracted in the step 2 are aggregated into a global feature, global feature learning is carried out to obtain a global feature descriptor, and local feature learning is carried out through an attention mechanism according to the shallow activation features S extracted in the step 2 to obtain a local feature descriptor; step 3, obtaining a specific step of global feature descriptors: step 3.1a, integrating feature dimensions by using a complete connection layer F linear mapping layer to finish the extraction of global features, wherein the formula of the global features g is as shown in (1): (1) where F is the linear mapping matrix, b is the bias, p is the norm of GeM Pooling, The super parameter p=3 for the representation element GeM Pooling; Step 3.2a, learning of global features uses normalized softmax and cross entropy loss, and reduces intra-class differences by introducing ARCFACE MARGIN, ARCFACE MARGIN calculation as shown in equation (2): (2) where u is cosine similarity, m is arc side distance, the arc allowance m=0.1, c is a binary value; The calculated cross entropy loss was normalized using softmax as shown in equation (3): (3) is a learnable scale parameter, initialized to be gamma= = 45.25; Is a class of The L2 normalized classifier weight of (2), y is the true value label of one-hot, and is 1 in k class; complete the global feature learning to obtain global descriptors ; The specific step of obtaining the local feature descriptor in the step 3 is as follows: Step 3.1b, using an automatic encoder AE structure to represent local features, namely adding a 1x1 convolution as an encoder T, reducing the number of channels of an original feature map to obtain a low-dimensional local feature representation, and in order to coordinate training, connecting a 1x1 convolution as a decoder at the back to reconstruct the original feature map by using the low-dimensional local feature; the loss function formula of the automatic encoder is shown in (4): (4) S is an input to which, Is generated after deconvolution; step 3.2b, using the attention network to perform weight assignment on the low-dimensional local features extracted in step 3.1b, wherein the selection of the local features mainly depends on a small attention module for selecting the most distinctive region, the attention heat map is obtained by a small convolution network, the output y of the attention mechanism is the weighted sum of the convolution feature weights extracted by the network, and the output of the network is shown in formula (5): (5) Score function Is based on local characteristics Training to obtain, wherein Is a function of Is a scoring function Through the parameters of Counter-propagation training, wherein the gradient is as shown in equation (6): (6) To prevent the scoring function from learning negative weights, limits The scoring function was designed with two layers of CNN with softplus activated on top, using a convolution filter of size 1x 1; Step 3.3b, performing integration on the local features by using the attention weights, and supervising the generation of attention force diagrams, wherein the formula is shown as (7): (7) Step 3.4b, finishing basic classification tasks of the local features in training, and forming cross entropy loss as shown in a formula (8): (8) the loss function of the whole network training is global feature loss, reconstruction feature loss and local feature loss, Wherein we set up Weights of (2) = 10, Weights of (2) = 1; Completing local feature learning to obtain a local descriptor l=t (S); And 4, calculating the similarity between the query image and the GoogleLandmark dataset image according to the local feature descriptor and the global feature descriptor, and ranking and selecting the images in the GoogleLandmark dataset according to the similarity to obtain the target image.
2. The image retrieval method based on attention enhancement and automatic coding fusion according to claim 1, wherein said step 2 is specifically implemented as follows: Given an image, global feature map and local feature map are obtained by using a hierarchical representation method of a convolution layer, shallow activation feature S in the local feature map is obtained from conv4 output and is recorded as S R≡ ([ H ] -S ] -W ] -S ] -C # - [ S ]) the depth activated feature D in the global feature map is obtained from the conv5 output and is denoted as D R-A ([ H-D-W-D-C, D) H, W, C represents the height, width and number of channels in each case, the number of channels for depth-activated feature D is 2048, the feature dimension is 2048, the number of channels for shallow-activated feature S is 1024, and the feature dimension is 128.
3. The image retrieval method based on attention enhancement and automatic coding fusion according to claim 1, wherein said step 4 is specifically implemented as follows: step 4.1, representing the images through a global descriptor, respectively calculating the distance between the query image and other images according to a Euclidean distance formula, and returning the first n images with the nearest distance; step 4.2, further sequencing and screening the n images in the step 1 through the local descriptor; And 4.3, carrying out similarity calculation on the query image and the n images subjected to sequencing and screening in the step 2 through a Euclidean distance formula, and returning the first m images with the highest similarity, wherein the m images are the target images finally searched.

Description

Image retrieval method based on attention enhancement and automatic coding fusion Technical Field The invention belongs to the technical field of computer image processing, and particularly relates to an image retrieval method based on attention enhancement and automatic coding fusion. Background At present, heterogeneous data such as images, video, audio, text, etc. is growing at a remarkable rate every day. For these massive images containing abundant visual information, how to conveniently, quickly and accurately query and retrieve the images required or interested by the user in these massive image libraries is a problem to be solved today. The images in the early stage are mostly indexed by manual labeling, and the defects of low efficiency, subjectivity and the like exist. TBIR (text-based image retrieval) requires the image uploader to give the necessary labels to the images, the system stores the index of the images according to the labels, and then the search engine searches for images close to the keywords provided by the user through the retrieval technique. However, since this method does not consider the content of the image, and is excessively dependent on keywords provided or collected by the user, the accuracy of the image search result cannot be ensured. CBIR (content-based image retrieval) is then proposed and has received a great deal of attention. The retrieval method is to extract bottom layer features of images and obtain a final result by using similarity among the image features, but the traditional method only extracts low layer features of the images, has the problem of semantic gap, extracts high layer features with semantic features by a convolutional neural network, has defects in deep convolution ResNet in the arrangement sequence of traditional residual blocks, does not perform normalization processing on an input feature map, can cause a BN layer to play a very large role, and simultaneously extracts only single features, can not judge the concerned region of the images, and causes the retrieval result to be not very high in accuracy. Disclosure of Invention The invention aims to provide an image retrieval method based on attention enhancement and automatic coding fusion, which solves the problem that the accuracy of a retrieval result is not very high because the attention area of an image cannot be judged in the prior art. The technical scheme adopted by the invention is that an image retrieval method based on attention enhancement and automatic coding fusion is used for extracting global feature mapping and local feature mapping by using an improved ResNet network, global feature descriptors and local feature descriptors are obtained according to the global feature mapping and the local feature mapping, image similarity is calculated, and a target image is obtained through similarity comparison. The invention is also characterized in that: the image retrieval method based on attention enhancement and automatic coding fusion is implemented according to the following steps: step 1, improving a traditional ResNet model, training a residual network, forming a network backbone, and obtaining a global branch network and a local branch network through the network backbone; Step 2, giving an image, obtaining two feature maps through the two branch networks in the step 1, namely a global feature map and a local feature map, and extracting a depth activation feature D in the global feature map and a shallow activation feature S in the local feature map; Step 3, the depth activation features D extracted in the step 2 are aggregated into a global feature, global feature learning is carried out to obtain a global feature descriptor, and local feature learning is carried out through an attention mechanism according to the shallow activation features S extracted in the step 2 to obtain a local feature descriptor; And 4, calculating the similarity between the query image and the GoogleLandmark dataset image according to the local feature descriptor and the global feature descriptor, and ranking and selecting the images in the GoogleLandmark dataset according to the similarity to obtain the target image. The step 1 is specifically implemented according to the following steps: step 1.1, improving the residual blocks of ResNet, wherein the arrangement sequence of the residual blocks is a batch normalization layer, then a convolution layer and finally a ReLU activation function layer; step 1.2, training a residual network ResNet network on GoogleLandmark data sets, wherein the iteration number is 100k, the learning rate is 1e-3, the weight attenuation is 0.0005, and training is completed to obtain a global branch network and a local branch network. The step2 is specifically implemented according to the following steps: Given an image, global feature map and local feature map are obtained by using a hierarchical representation method of a convolution layer, shallow activation feature S in the