CN-121982082-A - End-to-end visual semantic pose estimation method, equipment and medium for automatic driving

CN121982082ACN 121982082 ACN121982082 ACN 121982082ACN-121982082-A

Abstract

The invention discloses an end-to-end visual semantic pose estimation method, equipment and medium for automatic driving, and relates to the technical field of automatic driving. The method comprises the steps of obtaining a forward scene view through a visual sensor, and outputting a category label, a position coordinate and a detection frame size of a static/dynamic entity in a scene after feature extraction and semantic analysis. In the pose prediction module, embedding and space mapping operation are respectively carried out on the category labels to generate node data of the graphic neural network, and embedding operation is carried out on the position and size data to form edge data. And finally outputting the pose offset of the current pose relative to the target pose after the image neural network is processed by pooling and a feedforward network. The invention can enhance the pose estimation generalization capability based on the machine vision and the deep learning neural network, and improve the self-adaptive pose adjustment capability of automatic driving in complex dynamic scenes.

Inventors

XIE ZHENHUA
ZHANG SHANHUA
ZHANG CHAO
HAN YONG
GUAN CHENGHUI
ZHANG HONGJIE

Assignees

山东交通学院

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (7)

1. An end-to-end visual semantic pose estimation method for automatic driving is characterized by comprising the following steps of: the automatic driving carrier carries a visual sensor to shoot a forward scene view of the current pose, wherein the forward scene view is a 3-channel RGB view; The visual feature extraction module is used for carrying out feature extraction on the forward scene view and outputting a high-dimensional feature map, and comprises an image feature extraction neural network and a channel compression unit, wherein the image feature extraction neural network adopts Resnet convolution network or Swin-transform network; Inputting the high-dimensional feature map into a visual semantic tag extraction module for semantic analysis, respectively outputting category tag data and position coordinates of static or dynamic entities of a target area in a forward scene view, detecting frame size data, and respectively inputting the category tag data and the position coordinates and the detection frame size data into a pose prediction module; the visual semantic tag extraction module adopts a visual target detection network and comprises a first position encoder, a second position encoder, a transducer decoder, a static or dynamic entity detection initial value generator in a forward scene view, a first feedforward neural network and a second feedforward neural network; the pose prediction module comprises a graph neural network module, a category label embedding unit, a target pose view target point category label embedding unit, a space mapping unit, a target pose view target point position and detection frame size embedding unit, a global average pooling unit and a third feedforward neural network, wherein the graph neural network module adopts an improved GATv network, and the improved GATv network comprises a plurality of layers of attention mechanism network layers and Relu layers; in the pose prediction module, class label embedding operation, target pose view target point class label embedding operation and space mapping operation are sequentially carried out on class label data of a static or dynamic entity of a target area in the forward scene view, and an operation result is used as node data to be input into the image neural network; After the output data of the graphic neural network passes through the global averaging pooling unit and the third feedforward neural network, the displacement data of the perspective transformation source point of the multi-static or dynamic entity combined form of the target area in the forward scene view of the current pose of the automatic driving carrier relative to the perspective transformation target point in the pose view of the target is output; And performing independent training and joint training on the neural network in the visual feature extraction module, the visual semantic tag extraction module and the pose prediction module, and predicting displacement data of perspective transformation source points of multiple static or dynamic entity combination forms of a target region in a forward scene view of the current pose of the automatic driving carrier relative to perspective transformation target points in a target pose view after training is completed, so as to further realize pose adjustment of the automatic driving carrier.
2. The end-to-end visual semantic pose estimation method for automatic driving according to claim 1, wherein the specific processing procedure of the visual semantic tag extraction module is as follows: The first position encoder is utilized to carry out sine and cosine function combination of different frequencies to generate position information, the position information is embedded into a high-dimensional feature map through bit addition operation to generate a first feature tensor; inputting the first feature tensor into the transducer coder, extracting key feature information by utilizing a multi-head attention mechanism, and generating a second feature tensor; generating a search tensor of embedded position information by using a static or dynamic entity detection initial value generator in the forward scene view and the second position encoder; Inputting the second feature tensor and the search tensor into the transform decoder, and calculating the category and position size information of static or dynamic entities in the high-dimensional feature map by utilizing a multi-head attention mechanism to generate a third feature tensor; And respectively inputting the third characteristic tensor into the first feedforward neural network and the second feedforward neural network, calculating and generating category label data of a static or dynamic entity by using the first feedforward neural network, and calculating and generating position coordinates and detection frame size data of a static or dynamic entity detection frame by using the second feedforward neural network.
3. The end-to-end visual semantic pose estimation method for automatic driving according to claim 1, wherein the specific processing procedure of the pose prediction module is as follows: Inputting an output result of the first feedforward neural network into a category label embedding unit for processing, converting N-dimension One-Hot Encoder coding format data into N4-dimension dense tensor, and inputting the N-dimension dense tensor into a target pose view target point category label embedding unit; Defining 4 perspective transformation target points by taking a pixel coordinate system of a forward scene view as a reference coordinate system in the target pose view target point category label embedding unit, merging perspective transformation target point label vectors with the N4-dimensional dense tensor, sequentially putting the 4 perspective transformation target point label vectors into the forefront end of the merged tensor to form an (N+4) -4-dimensional dense tensor, and inputting the (N+4) -4-dimensional dense tensor into the space mapping unit; In the space mapping unit, performing space projection operation and mean normalization operation on the (n+4) -4-dimension dense tensor, taking an operation result as a node and a node characteristic vector of a subsequent graph, and inputting the node and the node characteristic vector into the improved GATv network; The output result of the second feedforward neural network is input into a target pose view target point position and detection frame size embedding unit to perform space position offset calculation and size difference calculation, 4 perspective transformation target points are respectively communicated with N static or dynamic entity position points in the output result of the second feedforward neural network, and the N static or dynamic entity position points are used as edges of a subsequent building diagram to be input into an improved GATv network; The characteristic tensor output by the improved GATv network is sequentially input into the global averaging and pooling unit and the third feedforward neural network, and displacement data of perspective transformation source points of multiple static or dynamic entity combination forms of a target area in a forward scene view of the current pose relative to perspective transformation target points in a target pose view are output.
4. The end-to-end visual semantic pose estimation method for autopilot according to claim 3, wherein the 4 perspective transformed target points are located at the top left, top right, bottom left and bottom right positions of the center point of the target pose view, respectively.
5. The end-to-end visual semantic pose estimation method for automatic driving according to claim 1, wherein the first feedforward neural network, the second feedforward neural network and the third feedforward neural network are all fully connected neural networks.
6. The electronic equipment is characterized by comprising a vision sensor, a processor, an AI accelerator, a memory and an IO and communication module, wherein the vision sensor shoots a forward scene view and inputs the memory, the AI accelerator is used for implementing neural network operation, the memory is used for storing a processor running program, the AI accelerator neural network program, the scene view and running process cache data, the IO and communication module is used for executing internal communication and external communication functions, and the processor is used for controlling the running flow of the electronic equipment and data interaction and functional logic among all internal devices so as to enable the electronic equipment to execute the automatic driving-oriented end-to-end vision semantic pose estimation method according to any one of claims 1-5.
7. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the end-to-end vision semantic pose estimation method for autopilot according to any one of claims 1-5.

Description

End-to-end visual semantic pose estimation method, equipment and medium for automatic driving Technical Field The invention relates to the technical field of automatic driving, in particular to an end-to-end visual semantic pose estimation method, equipment and medium for automatic driving. Background The automatic driving pose estimation technology based on visual image registration is a technology for aligning and matching an environment view captured by a vehicle-mounted sensor with a pre-constructed environment view (or a semantic map), and aims to solve the problem of estimating the position and the pose of an automatic driving vehicle relative to the surrounding environment. The automatic driving image registration technology can be divided into a geometric feature-based registration technology and a semantic feature-based registration technology, wherein geometric features in images are generally extracted by adopting SIFT, SURF and other algorithms, so that the automatic driving image registration technology is not suitable for complex dynamic scenes due to the fact that the geometric features are sensitive to dynamic targets, illumination changes and the like, semantic information (such as lane lines, buildings and the like) in scene views is extracted by utilizing semantic segmentation networks such as deep Lab, SETR and the like by utilizing the semantic feature-based registration technology, and semantic tags are matched with a pre-established semantic map so as to realize position or gesture estimation. At present, a registration technology based on semantic features generally adopts a 'pipeline' layout of a perception module and a decision module, and the semantic features are perceived by a deep learning neural network, but in the decision module, a feature matching rule is required to be designed manually, and the problems that the overall optimization of the perception and decision 'pipeline' is insufficient, the generalization capability is weak, the manual design rule is not suitable for complex dynamic scenes, the rapid deployment and application are difficult in scene switching and the like exist. Disclosure of Invention The invention aims to provide an end-to-end visual semantic pose estimation method, equipment and medium for automatic driving, and aims to solve or improve at least one of the technical problems. In order to achieve the above object, the present invention provides the following solutions: an end-to-end visual semantic pose estimation method for automatic driving comprises the following steps: The automatic driving carrier carries a visual sensor to shoot a forward scene view of the current pose, wherein the forward scene view is a 3-channel RGB image; The visual feature extraction module is used for carrying out feature extraction on the forward scene view and outputting a high-dimensional feature map, and comprises an image feature extraction neural network and a channel compression unit, wherein the image feature extraction neural network adopts Resnet convolution network or Swin-transform network; Inputting the high-dimensional feature map into a visual semantic tag extraction module for semantic analysis, and respectively outputting category tag data and position coordinates of static or dynamic entities of a target area in a forward scene view and detecting frame size data; the visual semantic tag extraction module adopts a visual target detection network and comprises a first position encoder, a second position encoder, a transducer decoder, a static or dynamic entity detection initial value generator in a forward scene view, a first feedforward neural network and a second feedforward neural network; And respectively inputting category label data and position coordinates of static or dynamic entities of the target area in the forward scene view and detection frame size data into a pose prediction module. The pose prediction module comprises an improved GATv network module, a category label embedding unit, a target pose view target point category label embedding unit, a space mapping unit, a target pose view target point position and detection frame size embedding unit, a global averaging pooling unit and a third feedforward neural network, wherein the improved GATv network is composed of a plurality of layers of attention mechanism network layers and Relu layers. In the pose prediction module, category label data of a static or dynamic entity of a target area in a forward scene view are sequentially subjected to category label embedding operation, target pose view target point category label embedding operation and space mapping operation, an operation result is used as node data to be input into an improved GATv network, position coordinates and detection frame size data are subjected to target pose view target point position and detection frame size embedding operation, an operation result is used as edge data to be input into an improved GATv network, the impr