CN-121999452-A - Multi-modal vectorization composition method and computing device

CN121999452ACN 121999452 ACN121999452 ACN 121999452ACN-121999452-A

Abstract

A multi-modal vectorization composition method comprises the steps of segmenting a global map to obtain a local map, wherein the global map is composed of multi-modal data, extracting multi-modal characteristics of the local map, obtaining local instance vectors according to the multi-modal characteristics, wherein the local instance vectors are data sequences of geometric semantic information of the local map, and obtaining the global vectorization map according to the local instance vectors. Therefore, the embodiment of the application outputs the local instance vector according to the multi-mode characteristics, can automatically extract and analyze the geometric and semantic information of the map without marking, and can realize the rolling update of the intelligent driving map, thereby helping the intelligent driving vehicle to better identify the surrounding environment.

Inventors

GAO BIN
JIN HUAN
CAI YINGJIE
ZHOU KAIQIANG
LIU BINGBING
YE AIXUE
JIANG LIHUI
ZHANG HONGBO

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241101

Claims (20)

1. A method of multi-modal vectorization patterning, the method comprising: Dividing a global map to obtain a local map, wherein the global map is composed of multi-mode data; extracting multi-modal features of the local map; Obtaining a local instance vector according to the multi-modal characteristics, wherein the local instance vector is a data sequence of geometric semantic information of the local map; And obtaining a global vectorization map according to the local instance vector, wherein the global vectorization map is an image of geometric semantic information of the global map represented in vectorization form.
2. The method of claim 1, wherein the deriving a local instance vector from the multi-modal feature comprises: fusing the multi-mode features to obtain a first fusion feature; Performing deep supervised learning on the first fusion features to obtain second fusion features; and decoding the first fusion feature and the second fusion feature to obtain the local instance vector.
3. The method according to claim 2, wherein the method further comprises: Acquiring multi-modal characteristics of surrounding images of the local map, wherein the surrounding images comprise images above, below, left and right of the local map; the multi-mode feature fusing is performed to obtain a first fusion feature, which comprises the following steps: performing interactive learning on the multi-modal features of the local map and the multi-modal features of the surrounding image by using a first cross-attention mechanism to obtain optimized multi-modal features; and fusing the optimized multi-mode features to obtain the first fusion feature.
4. The method according to claim 2, wherein the method further comprises: performing super-parameter learning on the multi-modal features to obtain weighted weights, wherein the weighted weights are adaptively updated according to scene changes on a local map; the multi-mode feature fusing is performed to obtain a first fusion feature, which comprises the following steps: and carrying out merging coding on the multi-mode features according to the weighting weights to obtain the first merging features.
5. The method according to any one of claims 2-4, further comprising: performing dense task learning on the local map to obtain dense features, wherein the dense features comprise semantic segmentation features and/or depth features of the local map; performing deep supervised learning according to the first fusion feature to obtain a second fusion feature, including: and performing interactive learning on the first fusion feature and the dense feature by using a second cross-attention mechanism to obtain the second fusion feature.
6. The method according to any one of claims 2-5, wherein said decoding according to the first fusion feature and the second fusion feature to obtain the local instance vector comprises: Performing self-learning update of the self-attention mechanism on the first fusion feature; and performing interactive learning on the second fusion feature and the self-learning updated first fusion feature by using a third cross-attention mechanism to obtain the local instance vector.
7. The method of claim 6, further comprising, after the second fused feature and the first fused feature of the self-learning update are interactively learned using a third cross-attention mechanism: and performing interactive learning on a priori feature and an output result of the third cross attention mechanism by using a fourth cross attention mechanism to obtain the local instance vector, wherein the priori feature is a vectorization feature of the local map.
8. The method of any of claims 1-7, wherein the multi-modal features comprise scalar features, and the extracting the multi-modal features of the local map comprises: partitioning the local map to obtain a plurality of partitioned images; Extracting multi-scale semantic information of each segmented image by utilizing an image feature extraction network of a plurality of transformers; And superposing the multi-scale semantic information of each segmented image to obtain scalar features of the local map.
9. The method of claim 8, wherein the multi-modal features further comprise vectorized features, the extracting the multi-modal features of the local map further comprising: Sampling the coordinates of N points of the local map, and determining the coordinates of the N points; determining the category of the N points; Splicing the coordinates and the categories of the N points to obtain the vectorization instance; Feature extraction is performed on the vectorized instance to obtain the vectorized features of the local map.
10. The method of claim 9, wherein the vectorized features comprise a priori features, the feature extracting the vectorized instance to obtain the vectorized features of the local map comprising: extracting coordinate features and category features of the vectorized instance through a position encoder and a category encoder; and splicing the coordinate features and the category features to obtain priori features.
11. The method according to any one of claims 1-10, wherein the number of local instance vectors is a plurality, and obtaining a global vectorized map from the local instance vectors comprises: Splicing a plurality of local instance vectors to obtain a global instance vector; and carrying out sparse sampling on the point coordinates of the global instance vector to obtain a global vectorized map.
12. An apparatus for multi-modal vectorization patterning, the apparatus comprising: the map segmentation module is used for segmenting a global map to obtain a local map, wherein the global map is composed of multi-mode data; the feature extraction module is used for extracting multi-modal features of the local map; An instance vector prediction model, which is used for obtaining a local instance vector according to the multi-modal characteristics, wherein the local instance vector is a data sequence of geometric semantic information of the local map; The vector stitching module is used for obtaining a global vectorization map according to the local instance vector, wherein the global vectorization map is an image of geometric semantic information of the global map, which is characterized in a vectorization mode.
13. The apparatus of claim 12, wherein the instance vector prediction model comprises: The multi-modal feature fusion module is used for fusing the multi-modal features to obtain a first fusion feature; the depth supervision module is used for carrying out depth supervision learning on the first fusion characteristics to obtain second fusion characteristics; And the sparse instance decoder is used for decoding the first fusion feature and the second fusion feature to obtain the local instance vector.
14. The apparatus of claim 13, wherein the apparatus further comprises: the system comprises a pseudo time sequence feature extraction model, a local map, a time sequence feature extraction model and a time sequence feature extraction model, wherein the pseudo time sequence feature extraction model is used for acquiring multi-mode features of surrounding images of the local map, and the surrounding images comprise upper, lower, left and right images of the local map; The multi-mode feature fusion module is also used for Performing interactive learning on the multi-modal features of the local map and the multi-modal features of the surrounding image by using a first cross-attention mechanism to obtain optimized multi-modal features; and fusing the optimized multi-modal features to obtain the first fused feature.
15. The apparatus of claim 13, wherein the apparatus further comprises: the weighting model is used for carrying out super-parameter learning on the multi-modal characteristics to obtain weighting weights, wherein the weighting weights are adaptively updated according to scene changes on the local map; the multi-modal feature fusion module is used for carrying out combined coding on the multi-modal features according to the weighting weights so as to obtain the first fusion features.
16. The apparatus according to any one of claims 13-15, characterized in that the apparatus further comprises: The dense task learning module is used for carrying out dense task learning according to the first fusion feature to determine dense features of the local map, wherein the dense features comprise semantic segmentation features and/or depth features of the local map; The depth supervision module is used for: and performing interactive learning on the first fusion feature and the dense feature by using a second cross-attention mechanism to obtain the second fusion feature.
17. The apparatus according to any of claims 13-16, wherein the sparse instance decoder is configured to perform a self-learning update of the self-attention mechanism on the first fusion feature; and performing interactive learning on the second fusion feature and the self-learning updated first fusion feature by using a third cross-attention mechanism to obtain the local instance vector.
18. The apparatus of claim 17, wherein the sparse instance decoder is further configured to interactively learn the prior feature, which is a vectorized feature of the local map, and an output result of the third cross-attention mechanism using a fourth cross-attention mechanism to obtain the local instance vector.
19. The apparatus of any of claims 12-18, wherein the multi-modal feature comprises a scalar feature, and wherein the feature extraction module comprises: The image feature extractor is used for blocking the local map to obtain a plurality of block images, extracting multi-scale semantic information of each block image by utilizing an image feature extraction network of a plurality of transformers, and superposing the multi-scale semantic information of each block image to obtain scalar features of the local map.
20. The apparatus of any of claims 12-19, wherein the multi-modal feature further comprises a vectorized feature, the feature extraction module further to: The method comprises the steps of carrying out N-point coordinate sampling on the local map, determining coordinates of N points, determining categories of the coordinates of the N points, carrying out stitching according to the coordinates and the categories of the N points to obtain the vectorization instance, and carrying out feature extraction on the vectorization instance to obtain vectorization features of the local map.

Description

Multi-modal vectorization composition method and computing device Technical Field The application relates to the technical field of intelligent driving, in particular to a multi-modal vectorization composition method and computing equipment. Background In the age of rapid development of technology today, intelligent driving technology is gradually changing into a realistic application. The extraction of geometric semantic information is a particularly important link in intelligent driving map making. The geometrical semantic information extraction refers to extracting specific geographic information from a map, wherein the specific geographic information comprises elements such as lane lines, intersection faces, road edge boundaries and the like, and the intelligent driving vehicle is positioned, path planning and control are carried out according to the geographic information. At present, the extraction of geometric semantic information needs to be marked with entity types, and the rolling update of the geometric semantic information cannot be realized. Disclosure of Invention According to the multi-modal vectorization composition method and the computing device, the geometric semantic information of the map is automatically extracted, entity category labeling is not needed for pixels on the map, rolling update of the intelligent driving map can be achieved, and the intelligent driving vehicle is helped to better identify surrounding environments. In a first aspect, an embodiment of the present application provides a multi-modal vectorization composition method, where the method includes segmenting a global map to obtain a local map, where the global map is made up of multi-modal data, extracting multi-modal features of the local map, obtaining a local instance vector according to the multi-modal features, where the local instance vector is a data sequence of geometric semantic information of the local map, obtaining a global vectorization map according to the local instance vector, and where the global vectorization map is an image of geometric semantic information of the global map represented in vectorization form. In this way, in the method provided by the embodiment of the application, the local instance vector for representing the geometric semantic information is obtained according to the multi-mode characteristics of the local map, and the extraction of the geometric semantic information of the local map can be realized without labeling the entity type of the pixels on the map. And the data volume of the local instance vector is small, the calculation efficiency is high, and the global vectorization map can be quickly obtained by utilizing the local instance vector, so that the rolling update of the geometric semantic information of the global map is realized, and the intelligent driving vehicle is helped to better identify the surrounding environment. In some embodiments, local instance vectors are obtained according to multi-modal features, including fusing the multi-modal features to obtain first fused features, performing deep supervised learning on the first fused features to obtain second fused features, and decoding the first fused features and the second fused features to obtain local instance vectors. Therefore, the embodiment of the application realizes automatic extraction and analysis of the geometric semantic information of the obtained map by fusing, supervising and learning and decoding the multi-modal features. In some embodiments, the method further comprises obtaining multi-modal features of surrounding images of the local map, wherein the surrounding images comprise images above, below, left and right of the local map, fusing the multi-modal features to obtain a first fused feature, wherein the method comprises performing interactive learning on the multi-modal features of the local map and the multi-modal features of the surrounding images by using a first cross-attention mechanism to obtain optimized multi-modal features, and fusing the optimized multi-modal features. Therefore, the embodiment of the application can enlarge the prediction visual field, strengthen the extraction capability of the semantic features and the geometric features of the model, improve the network recall rate and have strong capability of reproducing the road through the fusion of the pseudo time sequence features. In some embodiments, the method further comprises performing super-parameter learning on the multi-modal features to obtain weighted weights, adaptively updating the weighted weights according to scene changes on the local map, and fusing the multi-modal features to obtain first fused features, wherein the method comprises performing merging and encoding on the multi-modal features according to the weighted weights to obtain the first fused features. Therefore, the embodiment of the application uses the weighted weights between the global features and the local features to dynamically learn