Search

CN-116863164-B - Visual position identification method, electronic equipment and medium

CN116863164BCN 116863164 BCN116863164 BCN 116863164BCN-116863164-B

Abstract

The invention discloses a visual position recognition method, electronic equipment and a medium, which comprise the steps of obtaining an input image, extracting a feature vector of the input image by utilizing a convolutional neural network, training a principal component analysis conversion model based on unsupervised learning, reconstructing the feature vector of the input image by utilizing the principal component analysis conversion model to generate an image description vector, obtaining the image description vector of an existing image from a database, calculating the similarity between the image description vector of the input image and the image description vector of the existing image, and taking the existing image corresponding to the maximum similarity as the similar image of the input image when the maximum similarity is greater than or equal to a similarity threshold value to obtain a visual position recognition result.

Inventors

  • HU SONGYU
  • SUN WEINING
  • FU JIANZHONG

Assignees

  • 浙江大学

Dates

Publication Date
20260512
Application Date
20230703

Claims (9)

  1. 1. A method of visual location identification, the method comprising: Acquiring an input image; Extracting a feature vector of an input image by using a convolutional neural network; training a principal component analysis conversion model based on unsupervised learning, and reconstructing feature vectors of the input image by using the principal component analysis conversion model to generate an image description vector; obtaining the image description vector of the existing image from the database, and calculating the similarity between the image description vector of the input image and the image description vector of the existing image, wherein when the maximum similarity is greater than or equal to a similarity threshold value, the existing image corresponding to the maximum similarity is used as the similar image of the input image, so as to obtain a visual position recognition result; Wherein training a principal component analysis conversion model based on unsupervised learning, reconstructing feature vectors of an input image using the principal component analysis conversion model, the process of generating image description vectors comprising: Acquiring a training data set; Extracting feature vectors of each picture in the training data set to construct a training matrix X; performing line-by-line de-equalization on the training matrix X; For training matrix after de-averaging Calculating a covariance matrix C corresponding to the characteristic value lambda and a characteristic vector v of the covariance matrix C; the eigenvectors v are arranged in descending order according to the magnitude of the eigenvalue lambda, and a conversion matrix P is formed according to the converted target dimension K; Comprehensively evaluating the information compression characteristic, the accuracy performance of the algorithm before and after conversion and the operation efficiency performance of the algorithm before and after conversion based on the principal component vector space, determining an optimal target dimension K b , and generating an optimal conversion model P b according to the optimal target dimension K b , namely a principal component analysis conversion model P b ; Reconstructing a feature vector of the input image by using a principal component analysis conversion model to generate an image description vector, wherein the expression is as follows: ; Where x represents the image feature vector and y represents the image description vector after principal component conversion.
  2. 2. The method of claim 1, wherein obtaining the input image further comprises contrast enhancement of the input image, converting the input image from RGB color space to YUV color space, and homogenizing the individual histograms of Y channels, then converting the input image back to RGB color space and normalizing, and normalizing the size of the input image based on bicubic interpolation.
  3. 3. The visual position recognition method of claim 1, wherein the convolutional neural network comprises a first convolutional layer, a first Relu layer, a first max-pooling layer, a second convolutional layer, a second Relu layer, a second max-pooling layer, a third convolutional layer, and a Sigmoid layer, which are sequentially connected.
  4. 4. A visual location recognition method according to claim 3, wherein the convolutional neural network is constructed by: the convolutional neural network adopts AlexC model; Based on the structure and pre-training parameters of the general convolutional neural network model Alexnet; The first three convolution layers of the general convolution neural network model Alexnet are used as the main structure of the AlexC model, and the outputs of the three convolution layers are processed by combining with the Sigmoid layer to replace the maximum pooling and Relu activation operation; all the full connection layers in the general convolutional neural network model Alexnet are cut off; the number of convolution kernels of the first convolution layer and the second convolution layer is set to 64 and 192, and the number of convolution kernels of the third convolution layer remains unchanged.
  5. 5. The visual location recognition method of claim 4, wherein the expression of the Sigmoid layer is as follows: ; In the formula, The i element value of the feature vector output by the third convolution layer in AlexC models; an ith element value representing a feature vector output by the Sigmoid layer; The expression of Relu layers is as follows: ; where, max () represents a function taking the maximum value, Representing the input data of Relu layers, the output represents the output data of Relu layers, and the Relu layers output all the input data less than or equal to 0, and the input data greater than 0 remains unchanged.
  6. 6. The method of claim 1, wherein the training data set is a high dynamic visual scene picture which contains changes including illumination, viewing angle, object and season at the same time and has balanced proportion of various change samples.
  7. 7. The visual position recognition method according to claim 1, wherein the steps of obtaining the image description vector of the existing image from the database, calculating the similarity between the image description vector of the input image and the image description vector of the existing image, and when the maximum similarity is greater than or equal to a similarity threshold, using the existing image corresponding to the maximum similarity as the similar image of the input image, and obtaining the visual position recognition result include: Calculating cosine similarity cs of the image description vector I current of the input image and the image description vectors I past of all the existing images; Ordering the cosine similarity to obtain the maximum similarity; Comparing the maximum similarity with a similarity threshold, taking the existing image corresponding to the maximum similarity as the similar image of the input image if the maximum similarity is larger than or equal to the preset threshold, and obtaining a visual position identification result, namely the existing image of the input image with a similar position in the database, and if the maximum similarity is smaller than the preset threshold, the existing image of the current input image with no similar position in the database.
  8. 8. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor, wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the visual location recognition method of any of the preceding claims 1-7.
  9. 9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the visual position recognition method according to any one of claims 1-7.

Description

Visual position identification method, electronic equipment and medium Technical Field The present invention relates to the field of computer vision, and in particular, to a visual position identification method, an electronic device, and a medium. Background Visual location recognition is a task of determining whether an image belongs to the same place. The traditional visual position recognition algorithm takes a bag-of-word model as a prominent representative, and mainly utilizes some artificial design features extracted from the image to judge the similarity of the image. The algorithm has better effect on scene characteristics similar to manual design characteristics, but with the improvement of environmental complexity, the defect of insufficient expression capability of the manual design characteristics can lead to the failure of the whole algorithm. The current common deep learning visual position recognition algorithm adopts two ideas of supervised learning, namely, the first is based on a general convolutional neural network model for transfer learning, and the second is to design a network model special for the visual position recognition field and perform the supervised training from the beginning. The accuracy of the two algorithm ideas is obviously superior to that of the traditional algorithm in an actual scene with higher environmental complexity, but the requirement of the supervised learning process on the annotation data set greatly increases the cost of related research. In order to explore the algorithm thought with low implementation cost, some researchers introduce unsupervised learning into the field of visual position recognition, such as noise reduction convolution self-encoder structure, and successfully solve the problem of dependence on labeling samples. However, the unsupervised learning concept is difficult to compare with the supervised learning concept in terms of detection accuracy, especially in a high dynamic reality scenario including various changes of viewing angles, objects, illumination, seasons, and the like. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a visual position identification method, electronic equipment and a medium. According to a first aspect of an embodiment of the present invention, there is provided a visual position recognition method, the method including: Acquiring an input image; Extracting a feature vector of an input image by using a convolutional neural network; training a principal component analysis conversion model based on unsupervised learning, and reconstructing feature vectors of the input image by using the principal component analysis conversion model to generate an image description vector; And when the maximum similarity is greater than or equal to a similarity threshold value, taking the existing image corresponding to the maximum similarity as a similar image of the input image, and obtaining a visual position recognition result. According to a second aspect of the embodiment of the invention, there is provided an electronic device, including a memory and a processor, where the memory is coupled to the processor, and the memory is configured to store program data, and the processor is configured to execute the program data to implement the above-mentioned visual location recognition method. According to a third aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the visual position recognition method described above. The beneficial effects of the invention are as follows: The invention provides a visual position identification method, which integrates unsupervised learning and deep learning, maps an output vector of a convolutional neural network model to a principal component space with clearer vector trunk direction, and improves cosine similarity between similar image vectors. In addition, the method does not carry out the marking of training data and the retraining of models. The invention improves the detection precision on the premise of ensuring low implementation cost. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art. FIG. 1 is a flowchart of a visual position recognition method according to an embodiment of the present invention; FIG. 2 is a block diagram of AlexC convolutional neural network model; FIG. 3 is a graph of the results of information compression characteristics of principal component vector space of different target dimensions; FIG. 4