CN-121999043-A - Visual position identification method and system based on full feature aggregation

CN121999043ACN 121999043 ACN121999043 ACN 121999043ACN-121999043-A

Abstract

The application provides a visual position recognition method and a visual position recognition system based on full feature aggregation, wherein the method comprises the steps of obtaining visual image data of a vehicle at the current moment, inputting the visual image data into a preset position recognition model to enable the position recognition model to sequentially carry out image rolling and full feature aggregation on the visual image data to obtain corresponding vehicle position feature vectors, retrieving and obtaining corresponding positioning feature vectors from a target database according to the vehicle position feature vectors, wherein the target database is loaded in advance according to a current area of the vehicle, the target database is obtained by constructing a plurality of historical visual image data of the current area of the vehicle in a map construction mode, generating a visual position recognition result at the current moment according to map construction pose information of the positioning feature vectors, carrying out visual positioning correction on a parking process of the vehicle according to the visual position recognition result, and improving accuracy of visual position recognition.

Inventors

SONG KE
YU CHUNLIANG
YU JI
LIU GUOQING
YANG GUANG
WANG QICHENG
HUANG LIANG

Assignees

深圳佑驾创新科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (10)

1. A visual location recognition method based on full feature aggregation, comprising: acquiring visual image data of a vehicle at the current moment; Inputting the visual image data into a preset position identification model, so that the position identification model sequentially carries out image rolling and full-feature aggregation on the visual image data to obtain corresponding vehicle position feature vectors, wherein the position identification model is obtained based on a convolutional neural network model and a multi-layer perceptron model in a mixed construction mode; The corresponding positioning feature vector is obtained by searching from a target database according to the vehicle position feature vector, wherein the target database is loaded in advance according to the current area of the vehicle, and the target database is obtained by constructing a plurality of historical visual image data of the current area of the vehicle in a mapping mode; and generating a visual position identification result at the current moment according to the map-building pose information of the positioning feature vector, and carrying out visual positioning correction on the parking process of the vehicle according to the visual position identification result.
2. The method for identifying a visual position based on full feature aggregation according to claim 1, wherein the step of acquiring visual image data of a vehicle at a current time comprises: acquiring an original visual image frame of a vehicle at the current moment; Image clipping is carried out on the original visual image frame based on a preset target area, and a clipping image frame is obtained; scaling the clipping image frames to a preset size to obtain scaled image frames; performing format conversion on the scaled image frames based on a preset color-coded sampling format to obtain format-converted image frames; And converting the format conversion image frame into tensor data of floating point numbers and carrying out normalization processing to obtain the visual image data.
3. The method for identifying a visual position based on full feature aggregation as claimed in claim 1, wherein the inputting the visual image data into a preset position identification model to enable the position identification model to sequentially perform image rolling and full feature aggregation on the visual image data to obtain corresponding vehicle position feature vectors comprises: inputting the visual image data into the position identification model so as to generate a corresponding middle layer characteristic diagram through a convolution backbone network of the position identification model; Performing full feature aggregation on the middle layer feature map through a plurality of feature aggregators based on residual error connection in the position identification model to obtain corresponding mixed feature vectors; and carrying out channel projection, line projection and flattening operation on the mixed feature vector in sequence to obtain the vehicle position feature vector.
4. The visual position recognition method based on full feature aggregation as recited in claim 1, wherein the position recognition model is obtained based on a mixed construction of a convolutional neural network model and a multi-layer perceptron model, and the method comprises the following steps: Collecting a plurality of historical visual image data with pose information in a real scene of a preset area; generating a plurality of triplet training samples according to each historical visual image data and the space Euclidean distance between each historical visual image data, wherein each space Euclidean distance is obtained based on each pose information; The initial position identification model is obtained based on the mixed construction of the convolutional neural network model and the multi-layer perceptron model, wherein the convolutional neural network model is an initial convolutional backbone network, and the multi-layer perceptron model is a plurality of initial feature aggregators based on residual connection; And training the initial position recognition model according to a preset distance perception loss function and each triplet training sample to obtain the position recognition model.
5. The method for identifying a visual location based on full feature aggregation as recited in claim 4, wherein said generating a plurality of triples of training samples based on the respective historical visual image data and the spatial euclidean distance between the respective historical visual image data comprises: Respectively determining a positive sample set and a negative sample set corresponding to each historical visual image data according to the spatial Euclidean distance between each historical visual image data, wherein the positive sample set of any current historical visual image data is a plurality of first historical visual image data with the spatial Euclidean distance smaller than a preset distance threshold value with the current historical visual image data, and the negative sample set of the current historical visual image data is a plurality of second historical visual image data with the spatial Euclidean distance larger than or equal to the preset distance threshold value with the current historical visual image data; And generating a preset number of triplet training samples for each piece of current historical visual image data according to the corresponding positive sample set, the corresponding negative sample set and a preset sampling rule, wherein the triplet training samples comprise the first historical visual image data, the current historical visual image data and the second historical visual image data.
6. The visual location recognition method based on full feature aggregation according to claim 5, wherein the training the initial location recognition model according to a preset distance perception loss function and each triplet training sample to obtain the location recognition model comprises: Determining a first spatial Euclidean distance of each triplet training sample according to the spatial Euclidean distance between the current historical visual image data and the first historical visual image data in the triplet training sample; determining a second spatial Euclidean distance of each triplet training sample according to the spatial Euclidean distance between the current historical visual image data and the second historical visual image data in the triplet training sample; Determining interval level parameters corresponding to the triplet training samples according to the first space Euclidean distance and the second space Euclidean distance in each triplet training sample; Inputting each triplet training sample into the initial position recognition model respectively so that the initial position recognition model generates each corresponding triplet feature vector respectively; determining the first cosine similarity of each triplet feature vector according to the cosine similarity between the current feature vector and the first feature vector in the triplet feature vector; Determining the second cosine similarity of each triplet feature vector according to the cosine similarity between the current feature vector and the second feature vector in the triplet feature vector; and calculating through the distance sensing loss function to obtain each corresponding loss function value according to each first space Euclidean distance, each second space Euclidean distance, each first cosine similarity, each second cosine similarity and each interval level parameter, and further carrying out parameter optimization on the initial position recognition model according to each loss function value to obtain the position recognition model.
7. The method for identifying a visual location based on full feature aggregation as claimed in claim 1, wherein said retrieving a corresponding locating feature vector from said target database based on said vehicle location feature vector comprises: Acquiring floor information of the vehicle at the current moment; screening and obtaining a plurality of first candidate feature vectors from the target database according to the floor information; and determining the positioning feature vector from each candidate feature vector according to cosine similarity between the vehicle position feature vector and each candidate feature vector in the target database.
8. The visual location recognition method according to claim 7, wherein the step of obtaining a plurality of candidate feature vectors from the target database according to the floor information comprises: inquiring and determining a plurality of first candidate feature vectors from the target database according to the floor information; if the number of the first candidate feature vectors is greater than a preset number threshold, taking each first candidate feature vector as the plurality of candidate feature vectors; and if the number of the first candidate feature vectors is smaller than or equal to the preset number threshold, taking all feature vectors in the target database as the candidate feature vectors.
9. The visual location recognition method of claim 7, wherein said determining said locating feature vector from each of said candidate feature vectors based on cosine similarity between said vehicle location feature vector and each of said candidate feature vectors in said target database comprises: performing norm calculation on the vehicle position feature vector to obtain a corresponding query norm; According to the query norm and the norm of each candidate feature vector, conducting norm range filtering and norm ratio filtering on each candidate feature vector so as to screen a plurality of screening feature vectors from each candidate feature vector; And respectively calculating cosine similarity between the vehicle position feature vector and each screening feature vector, and determining the screening feature vector corresponding to the maximum cosine similarity as the positioning feature vector.
10. The visual position recognition system based on the full-feature aggregation is characterized by comprising an acquisition module, a position recognition module, a retrieval module and a positioning module; The acquisition module is used for acquiring visual image data of the vehicle at the current moment; The position recognition module is used for inputting the visual image data into a preset position recognition model so that the position recognition model sequentially carries out image rolling and full-feature aggregation on the visual image data to obtain corresponding vehicle position feature vectors, wherein the position recognition model is obtained based on a convolution neural network model and a multi-layer perceptron model in a mixed mode; The retrieval module is used for retrieving and obtaining a corresponding positioning feature vector from a target database according to the vehicle position feature vector, wherein the target database is loaded in advance according to the current area of the vehicle, and the target database is constructed and obtained according to a plurality of historical visual image data of the current area of the vehicle in a map construction mode; the positioning module is used for generating a visual position identification result at the current moment according to the mapping pose information of the positioning feature vector and carrying out visual positioning correction on the parking process of the vehicle according to the visual position identification result.

Description

Visual position identification method and system based on full feature aggregation Technical Field The application relates to the technical field of visual position identification, in particular to a visual position identification method and system based on full-feature aggregation. Background Visual location recognition (Visual Place Recognition, VPR) is used as a key technology in autonomous parking systems, mainly to achieve global repositioning of vehicles. When the vehicle runs in the constructed map area, the VPR identifies the current accurate position of the vehicle by comparing the real-time image with the pre-stored key frame image in the map database, so that reliable initialization pose is provided for a synchronous positioning and map building (SLAM) system or accumulated errors are corrected, and the method is a core link for ensuring the robustness of parking positioning. Existing VPR techniques can be largely divided into two broad categories. One class is classical methods based on manual features, such as bag of words model (BoW), local aggregation descriptor Vector (VLAD), etc. Such methods build a global representation of an image by extracting and aggregating manually designed local features of SIFT, ORB, etc. The method has the advantages of being simple to implement and strong in interpretability, but limited in distinguishing capability of manual features when the repeated textures, the intense illumination changes, the dynamic shielding and the visual angle disturbance are commonly existed in parking scenes such as underground garages and the like, the confusion of similar places is easy to cause, and the high-precision fine-grained repositioning requirement is difficult to support. Another class is deep learning based methods such as NetVLAD, generalized mean pooling (GeM), and the like. The method utilizes Convolutional Neural Network (CNN) to automatically extract the image feature map, and generates global descriptors for similarity retrieval through a learnable aggregation layer, so that the overall performance is superior to that of a manual method. However, the inherent local receptive field characteristic of convolution operation, in addition to the insufficient fine granularity distinguishing capability of the traditional feature aggregation mode on similar scenes, limits the modeling capability of remote spatial relationship and cross-view angle change of images, and can still generate fine granularity mismatching under the parking scenes with highly similar structures. Disclosure of Invention Aiming at the technical problems, the application provides a visual position identification method and a visual position identification system based on full-feature aggregation, which improve the accuracy of visual position identification. In a first aspect, an embodiment of the present application provides a visual location recognition method based on full feature aggregation, including: acquiring visual image data of a vehicle at the current moment; Inputting the visual image data into a preset position identification model, so that the position identification model sequentially carries out image rolling and full-feature aggregation on the visual image data to obtain corresponding vehicle position feature vectors, wherein the position identification model is obtained based on a convolutional neural network model and a multi-layer perceptron model in a mixed construction mode; The corresponding positioning feature vector is obtained by searching from a target database according to the vehicle position feature vector, wherein the target database is loaded in advance according to the current area of the vehicle, and the target database is obtained by constructing a plurality of historical visual image data of the current area of the vehicle in a mapping mode; and generating a visual position identification result at the current moment according to the map-building pose information of the positioning feature vector, and carrying out visual positioning correction on the parking process of the vehicle according to the visual position identification result. The embodiment of the application provides a visual position identification method based on full-feature aggregation, which fundamentally improves the accuracy and the robustness of visual position identification under complex and similar scenes by constructing a position identification model of a Convolutional Neural Network (CNN) and multi-layer perceptron (MLP) mixture and introducing a full-feature aggregation mechanism. In automatic driving, especially in parking scenes such as underground garages, the traditional method often fails due to factors such as repeated textures, abrupt illumination changes, visual angle changes and the like. In the embodiment, local and multi-level spatial features of an image are efficiently extracted through a convolutional neural network model, so that a rich middle-layer feature map is formed. And