CN-116152765-B - Semantic representation processing method and device, electronic equipment and computer storage medium

CN116152765BCN 116152765 BCN116152765 BCN 116152765BCN-116152765-B

Abstract

The embodiment of the invention provides a semantic representation processing method, a semantic representation processing device, electronic equipment and a computer readable storage medium, which relate to the field of automatic driving and comprise the steps of extracting features from images to obtain plane view features, and extracting features from laser radar data to obtain radar features; the method comprises the steps of carrying out geometric attention fusion on a plane view feature and a radar feature to obtain the plane view feature and the radar aerial view feature, cutting the plane view feature into blocks to obtain a plurality of plane view feature blocks, marking the plane view feature blocks as semantic marks, cutting the radar aerial view feature blocks to obtain a plurality of radar aerial view feature blocks, marking the radar aerial view feature blocks as semantic marks, carrying out semantic fusion on the plane view feature blocks and the radar aerial view feature blocks based on the semantic marks to obtain semantic representation, and carrying out semantic reconstruction on the semantic representation based on a mask to obtain the reconstructed semantic representation. The embodiment of the invention combines the advantages of geometric fusion and semantic fusion.

Inventors

DUAN YIQUN
GUO XIANDA
ZHU ZHENG

Assignees

北京鉴智科技有限公司

Dates

Publication Date: 20260508
Application Date: 20230202

Claims (10)

1. A method of processing a semantic representation, the method comprising: Extracting features from the image to obtain a plane view feature, and extracting features from laser radar data to obtain radar features; Performing geometric attention fusion on the plane view features and the radar features to obtain fused plane view features and radar aerial view features; Cutting the fused plane view features into blocks to obtain a plurality of plane view feature blocks, marking the plurality of plane view feature blocks as semantic marks, cutting the radar aerial view features into blocks to obtain a plurality of radar aerial view feature blocks, and marking the plurality of radar aerial view feature blocks as semantic marks; Carrying out semantic fusion on the plane view feature block and the radar aerial view feature block based on the semantic mark to obtain semantic representation; Performing semantic reconstruction on the semantic representation based on a mask to obtain a reconstructed semantic representation; The geometric attention fusion is carried out on the plane view feature and the radar feature to obtain a fused plane view feature and a radar aerial view feature, and the method comprises the following steps: and performing geometric attention fusion on the plane view features and the radar features by adopting monotone to aerial view conversion attention, and obtaining fused plane view features and radar aerial view features.
2. The method for processing semantic representations according to claim 1, wherein the performing geometric attention fusion on the plane view feature and the radar feature by using monotone to aerial view conversion attention to obtain a fused plane view feature and radar aerial view feature comprises: Cutting the plane view feature to obtain a plurality of plane view feature vectors, and cutting the radar feature to obtain a plurality of radar feature vectors; Geometrically fusing the plurality of plane view feature vectors and the plurality of radar feature vectors to obtain fused feature vectors; and mapping the fused feature vectors into fused plane view features and radar aerial view features through an attention mechanism.
3. The method for processing the semantic representation according to claim 1, wherein the semantic fusion of the plane view feature block and the radar bird's eye view feature block based on the semantic mark to obtain the semantic representation comprises: And carrying out space feature alignment on the semantic marks of the plane view feature block and the radar aerial view feature block in a preset semantic space by adopting an encoder with a position embedded, so as to obtain aligned semantic representations.
4. The method for processing the semantic representation according to claim 1, wherein the performing semantic reconstruction on the semantic representation based on the mask to obtain the reconstructed semantic representation comprises: masking the semantic representation according to a preset proportion by using a mask to obtain a semantic representation comprising a masking part; reconstructing the covering part to obtain the reconstructed semantic representation.
5. A processing apparatus for semantic representation, the apparatus comprising: The extraction module is used for extracting features from the image to obtain a plane view feature, and extracting features from laser radar data to obtain radar features; the conversion module is used for carrying out geometric attention fusion on the plane view features and the radar features to obtain fused plane view features and radar aerial view features; the marking module is used for cutting the fused plane view characteristics into blocks to obtain a plurality of plane view characteristic blocks, marking the plurality of plane view characteristic blocks as semantic marks, cutting the radar aerial view characteristics into blocks to obtain a plurality of radar aerial view characteristic blocks, and marking the plurality of radar aerial view characteristic blocks as semantic marks; the alignment module is used for carrying out semantic fusion on the plane view feature block and the radar aerial view feature block based on the semantic mark to obtain semantic representation; The reconstruction module is used for carrying out semantic reconstruction on the semantic representation based on a mask to obtain a reconstructed semantic representation; the conversion module is specifically configured to: and performing geometric attention fusion on the plane view features and the radar features by adopting monotone to aerial view conversion attention, and obtaining fused plane view features and radar aerial view features.
6. The processing device of semantic representations according to claim 5, characterized in that the conversion module is specifically configured to: Cutting the plane view feature to obtain a plurality of plane view feature vectors, and cutting the radar feature to obtain a plurality of radar feature vectors; Geometrically fusing the plurality of plane view feature vectors and the plurality of radar feature vectors to obtain fused feature vectors; and mapping the fused feature vectors into fused plane view features and radar aerial view features through an attention mechanism.
7. The semantic representation processing apparatus according to claim 5, wherein the alignment module is specifically configured to: And carrying out space feature alignment on the semantic marks of the plane view feature block and the radar aerial view feature block in a preset semantic space by adopting an encoder with a position embedded, so as to obtain aligned semantic representations.
8. The processing apparatus of semantic representations according to claim 5, characterized in that the reconstruction module is specifically configured to: masking the semantic representation according to a preset proportion by using a mask to obtain a semantic representation comprising a masking part; reconstructing the covering part to obtain the reconstructed semantic representation.
9. An electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor, implements the steps of the method for processing a semantic representation according to any one of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the processing method of semantic representations according to any of claims 1-4.

Description

Semantic representation processing method and device, electronic equipment and computer storage medium Technical Field The present invention relates to the field of autopilot technology, and in particular, to a semantic representation processing method, a semantic representation processing apparatus, an electronic device, and a computer readable storage medium. Background The latest trends in autopilot tasks can be categorized into pipeline and end-to-end paradigms. The assembly line paradigm breaks driving into sequential module tasks, mainly including positioning, scene reconstruction, path planning, driving control, and the like. End-to-end driving paradigm multi-application state-the simulated learning or reinforcement learning of motion mediates feature representation states to teach agents to work properly in a given driving environment. However, pipeline paradigms are often designed for specific perceived tasks, rather than an integrated, comprehensive state representation. Current methods of these 3D driving feature fusion are not well suited for comprehensive end-to-end driving tasks. In addition, the end-to-end driving paradigm requires separate branching in different modalities and then a final fusion is done by attention, but this fusion is a purely geometric fusion that can hinder the performance of complex urban end-to-end driving because geometric transformations and network downsampling can lose special line information for autopilot, such as traffic lights at a distance. Disclosure of Invention In view of the above, embodiments of the present invention have been made to provide a processing method of a semantic representation, a processing apparatus of a semantic representation, an electronic device and a computer readable storage medium that overcome or at least partially solve the above problems. In order to solve the above problems, an embodiment of the present invention discloses a semantic representation processing method, where the method includes: Extracting features from the image to obtain a plane view feature, and extracting features from laser radar data to obtain radar features; Performing geometric attention fusion on the plane view feature and the radar feature to obtain a plane view feature and a radar aerial view feature; Cutting the plane view feature into blocks to obtain a plurality of plane view feature blocks, marking the plurality of plane view feature blocks as semantic marks, cutting the radar aerial view feature blocks to obtain a plurality of radar aerial view feature blocks, and marking the plurality of radar aerial view feature blocks as semantic marks; Carrying out semantic fusion on the plane view feature block and the radar aerial view feature block based on the semantic mark to obtain semantic representation; And carrying out semantic reconstruction on the semantic representation based on a mask to obtain a reconstructed semantic representation. In one or more embodiments, the performing geometric attention fusion on the plane view feature and the radar feature to obtain a plane view feature and a radar bird's eye view feature includes: and performing geometric attention fusion on the plane view feature and the radar feature by adopting monotone to aerial view conversion attention to obtain the plane view feature and the radar aerial view feature. In one or more embodiments, the performing geometric attention fusion on the plane view feature and the radar feature by using monotone to bird's eye view conversion attention to obtain the plane view feature and the radar bird's eye view feature includes: Cutting the plane view feature to obtain a plurality of plane view feature vectors, and cutting the radar feature to obtain a plurality of radar feature vectors; Geometrically fusing the plurality of plane view feature vectors and the plurality of radar feature vectors to obtain fused feature vectors; And mapping the fused feature vectors into a plane view feature and a radar aerial view feature through an attention mechanism. In one or more embodiments, performing semantic fusion on the plane view feature block and the radar aerial view feature block based on the semantic tag to obtain a semantic representation, including: And carrying out space feature alignment on the semantic marks of the plane view feature block and the radar aerial view feature block in a preset semantic space by adopting an encoder with a position embedded, so as to obtain aligned semantic representations. In one or more embodiments, the performing semantic reconstruction on the semantic representation based on the mask to obtain a reconstructed semantic representation includes: masking the semantic representation according to a preset proportion by using a mask to obtain a semantic representation comprising a masking part; reconstructing the covering part to obtain the reconstructed semantic representation. Correspondingly, the embodiment of the invention also discloses a semantic represen