CN-121980175-A - Multi-mode condition driven sensing and generating closed loop optimization method, device and medium

CN121980175ACN 121980175 ACN121980175 ACN 121980175ACN-121980175-A

Abstract

The invention discloses a multi-modal condition driven perception and generation closed-loop optimization method, device and medium, which are applied to an automatic driving long tail scene, and are characterized in that multi-modal conditions such as text description, semantic map, 3D layout and the like are uniformly represented, a causal inference network is introduced, causal structures among scene elements are inferred explicitly from multi-modal input, structured causal embedding is generated, causal perception enhancement of the generation conditions is realized, a causal consistency mechanism is deeply integrated into a condition control and denoising process in a diffusion generation stage, and scene generation which is unreasonable or violates common sense is effectively inhibited by a diffusion model in the generation process through a counter fact condition constructed by the causal inference network, so that the semantic rationality, spatial consistency and dynamic reliability of long tail data are improved remarkably. The method not only solves the problem of deficient long-tail scene data, but also introduces causal constraint at the generation source, and remarkably improves the reliability, robustness and cross-domain generalization capability of the automatic driving system in an extreme scene.

Inventors

PAN CONG
YANG MO

Assignees

南京航空航天大学

Dates

Publication Date: 20260505
Application Date: 20260108

Claims (9)

1. A multi-mode condition driven perception and generation closed loop optimization method is applied to an automatic driving long tail scene and is characterized by comprising the following steps: (1) Uniformly characterizing the multi-mode conditions, and fusing the multi-mode conditions into uniform condition vectors; (2) Introducing a scene cause and effect inference network CIN, embedding scene cause and effect map information in multi-mode condition control, and constructing a condition control space of cause and effect perception as an input of a generation model; (3) Constructing a causal consistency constraint diffuse noise scheduler CANS, introducing causal modulation noise parameters into a denoising network, and realizing dynamic alignment of a scene causal structure in a diffusion process; (4) In the diffusion generation process, a causal consistency noise scheduling and condition control mechanism is introduced, and a long tail scene conforming to multi-mode conditions and causal structure constraint is generated in a targeted manner; (5) Analyzing the feature difference and the semantic consistency of the generated result, introducing a semantic consistency judging mechanism, carrying out consistency check and semantic completion on the generated scene by combining a causal map and text conditions, constructing an automatically marked long tail scene training sample, and enhancing and updating the data of the perception model; (6) Testing the model on a cross-scene data set, generating an optimized generation model through difference detection, problem positioning and feedback, and realizing the dynamic causal calibration of the generation process.
2. The multi-modal condition-driven perception and generation closed loop optimization method of claim 1, wherein step (1) comprises the steps of: (11) Inputting the multi-view image into a pre-trained segmentation all model SAM to obtain a semantic segmentation mask, and obtaining semantic coding features of n views through an encoder; (12) Taking a predefined BEV map semantic segmentation and a 3D bounding box as 3D layout conditions, projecting the defined BEV map semantic segmentation and the 3D bounding box to a two-dimensional image plane through camera parameters, and obtaining 3D layout coding features through an encoder ; (13) Text encoding features using a pre-trained CLIP model ; (14) All multi-modal condition controls are mapped to a d-dimensional feature space, and image-related features are connected to obtain the condition controls 。
3. The multi-modal condition-driven perception and generation closed loop optimization method of claim 1, wherein step (2) comprises the steps of: (21) Entity set extraction based on entity detection and semantic mask Inferring an initial causal relationship candidate set based on text description information and three-dimensional layout conditions Inferring that the network CIN is in with learnable scenario cause and effect , Outputting the weighted edge set to obtain a causal graph Will be Mapping into a vector of a fixed dimension ; (22) Controlling image conditions by an attention mechanism Text condition control And causal embedding And (5) fusing to obtain the final condition control c.
4. The multi-modal condition-driven perception and generation closed loop optimization method of claim 1, wherein said step (3) comprises the steps of: (31) Constructing a causal consistency constraint diffuse noise scheduler (CANS) and embedding the causal embedded vector Mapping network by a noise weight Generating causal perceived noise adjustment coefficients For dynamically adjusting noise weights in diffusion processes , Wherein 、 The noise scheduling parameters are adjusted for causal consistency; (32) In the UNet denoising function, causal adjustment noise parameters are introduced 、 Generating time-dependent weights by combining with a dynamic condition gating module Corresponding to the image condition, the text condition and the causal map condition respectively, constructing a final time step condition control vector The condition vector is used as a control signal in the denoising network.
5. The multi-modal condition-driven perception and generation closed loop optimization method of claim 1, wherein said step (4) comprises the steps of: (41) A group of multi-view images x is input, first through an encoder network Compression of images into a low-dimensional potential spatial representation Obtained by adopting causal consistency noise scheduler in diffusion process 、 Constructing a causal perceived forward diffusion distribution: Obtaining a noise figure through a diffusion process ; (42) To ensure that the forward diffuse noise scheduling process is consistent with the structure of the scene causal map, causal embedding vectors are based Constructing causal consistency constraint terms Expressed as: wherein To remove causal embedded condition vectors; (43) Map noise Through UNet denoising network and decoder module And adjusting noise parameters in combination with causality Reverse denoising is carried out to obtain a generated multi-view image Integration of conditional denoising functions in UNet model Allowing for various forms of input Is subjected to a controlled synthesis process whose objective function is expressed as: 。
6. the multi-modal condition-driven perception and generation closed loop optimization method of claim 1, wherein said step (5) comprises the steps of: (51) Feature difference analysis, input generation domain With the target domain Is characterized by the following steps: wherein Respectively a characteristic mean and a covariance matrix; (52) Semantic difference analysis, class distribution of input perception model output Constructing a long-tail perceived KL divergence loss term: Wherein the weight coefficient Weighted by inverse frequency, expressed as: ; (53) To prevent the generation of false samples inconsistent with scene semantics by generating models, a semantic consistency discriminator is introduced For judging multi-view images Whether or not the text condition is met Causal map of scene High-level semantic priori obtained by reasoning and calculating semantic consistency loss according to the priori Wherein Representing slave The extracted semantic graph is used for generating a semantic graph, Representing a structured semantic prior graph inferred from text and causal graph, the overall objective function is represented as: wherein 、、 Is a super-parameter which is used for the processing of the data, Is a regularization term; (54) Based on the resulting multi-view image Combining predictive semantic graphs Semantic prior graphs Performing consistency check and semantic completion on the generated samples, and when the generated samples meet the semantic consistency constraint, utilizing And (3) with Automatically endowing the generated long-tail scene sample with pixel-level semantic labels and structured relation labels to form an automatically labeled long-tail scene training sample set Wherein Is the final semantic annotation result, Generating a structured causal relationship annotation set which is explicitly restored in a sample; (55) Long tail scene sample set obtained by automatic labeling With true annotation datasets Mixing and constructing an enhanced training set And training or fine-tuning the perception model by utilizing the enhanced training set to obtain updated perception model parameters.
7. The multi-modal condition-driven perception and generation closed loop optimization method according to claim 1, wherein said step (6) comprises the steps of: (61) Constructing a differential sensitivity matrix for differential positioning: wherein Is a semantic segmentation mask; (62) And correcting the condition parameters of the generated model, wherein the adjustment quantity is as follows: , wherein, Is the meta-learning rate.
8. A storage medium having stored thereon a computer program which, when executed by at least one processor, implements the steps of the multi-modal condition driven perception and generation closed loop optimization method of any one of claims 1 to 7.
9. An electronic device comprising a memory and a processor, wherein: a memory for storing a computer program capable of running on the processor; a processor for performing the steps of the multi-modal condition driven perception and generation closed loop optimization method of any one of claims 1 to 7 when running the computer program.

Description

Multi-mode condition driven sensing and generating closed loop optimization method, device and medium Technical Field The invention relates to the technical fields of computer vision, generation type models, artificial intelligence and automatic driving, in particular to a multi-mode condition driven perception and generation closed loop optimization method, device and medium. Background The generated model which is typically represented by the diffusion model opens up a new technical path for coping with the automatic driving perception task in the dynamic complex environment. The model has the capability of efficiently modeling complex data distribution, and can synthesize visual data with high sense of realism and diversified traffic scenes, so that more comprehensive and more coverage training data support is provided for a perception model, and the model is particularly suitable for key tasks such as long-tail scene generation, high-fidelity simulation and the like. In recent years, the diffusion model-based automatic driving multi-source perception and controllable generation method is gradually paid attention to the academic world and the industry, and related research has achieved staged results. By means of implicit characterization learning and probability-driven generation mechanisms in the diffusion model, research has preliminarily realized joint modeling of multi-mode data such as laser radar point cloud, visual images, semantic maps, text descriptions and the like, and further promotes synthesis and application of dynamic scenes. However, one of the core challenges of autopilot technology is to cope with complexity and diversity in the real world, especially long-tail problems of low frequency but high-risk edge scenes and cross-scene generalization problems of adaptability of autopilot perception models in different environments. Therefore, how to construct the controllable long tail scene generation driven by the multi-mode condition coupling to solve the domain adaptation problem of the perception model caused by the scarcity of difficult samples is a key scientific problem to be solved in the project. The invention controls long tail scene generation through multi-mode conditions, designs a framework for perception-generation closed loop optimization, further enhances the cross-scene generalization capability of a perception model under difficult samples, and is a task with practical significance. Disclosure of Invention Aiming at the problem of weak model generalization capability caused by long-tail data scarcity of an automatic driving scene, the invention aims to design a multi-modal condition driven perception and generation closed-loop optimization framework, study how to control and optimize the perception and generation process by using multi-modal conditions under the condition of data scarcity and uneven distribution, and achieve cooperative optimization of the perception and generation model by closed-loop iteration, so as to finally enhance the cross-scene generalization capability of the automatic driving system. The multi-mode condition driven perception and generation closed loop optimization method is applied to an automatic driving long tail scene and comprises the following steps: (1) Uniformly characterizing the multi-mode conditions, and fusing the multi-mode conditions into uniform condition vectors; (2) Introducing a scene cause and effect inference network CIN, embedding scene cause and effect map information in multi-mode condition control, and constructing a condition control space of cause and effect perception as an input of a generation model; (3) Constructing a causal consistency constraint diffuse noise scheduler CANS, introducing causal modulation noise parameters into a denoising network, and realizing dynamic alignment of a scene causal structure in a diffusion process; (4) In the diffusion generation process, a causal consistency noise scheduling and condition control mechanism is introduced, and a long tail scene conforming to multi-mode conditions and causal structure constraint is generated in a targeted manner; (5) Analyzing the feature difference and the semantic consistency of the generated result, introducing a semantic consistency judging mechanism, carrying out consistency check and semantic completion on the generated scene by combining a causal map and text conditions, constructing an automatically marked long tail scene training sample, and enhancing and updating the data of the perception model; (6) Testing the model on a cross-scene data set, generating an optimized generation model through difference detection, problem positioning and feedback, and realizing the dynamic causal calibration of the generation process. Further, the step (1) includes the steps of: (11) Inputting the multi-view image into a pre-trained segmentation all model SAM to obtain a semantic segmentation mask, and obtaining semantic coding features of n views through an e