CN-121980509-A - Multi-mode and on-sense integrated method based on artificial intelligence large model

CN121980509ACN 121980509 ACN121980509 ACN 121980509ACN-121980509-A

Abstract

The invention discloses a multimode through sense integration method based on an artificial intelligence large model, which comprises the following steps: acquiring target scene data to construct a multi-mode sense-of-general integrated data input stream; extracting heterogeneous features based on a modality specific encoder; performing cross-modal dynamic fusion based on a hybrid expert framework; inputting the unified characteristics after fusion to a task-specific output head, and outputting a prediction result of a downstream task; multitasking joint training combining task loss function and load balancing loss; firstly, accurately extracting features and unifying dimensions; secondly, the mixed expert mechanism is flexibly adapted to dynamic environments such as night vision deficiency and the like through a sparse activation strategy while greatly reducing the calculation cost, so that the perceived robustness is ensured; finally, combining the lightweight wharf solving and load balancing training strategy, the expert is effectively prevented from collapsing, and the high-precision low-delay diversified passsense task is supported.

Inventors

XIANG LUPING
PENG YUBO
YANG KUN
ZHANG GUANGYE

Assignees

南京大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (6)

1. The multimode through sense integrated method based on the artificial intelligence large model is characterized by comprising the following steps of: S1, acquiring target scene data to construct a multi-mode sense integrated data input stream; s2, extracting heterogeneous characteristics based on a mode specific encoder; S3, performing cross-mode dynamic fusion based on a hybrid expert framework; s4, inputting the unified characteristics after fusion to a task-specific output head, and outputting a prediction result of a downstream task; S5, combining the task loss function and the multi-task combined training of the load balancing loss.
2. The multi-modal integrated method based on the artificial intelligence large model according to claim 1, wherein in step S2, the feature extraction process specifically includes the following sub-steps according to the physical characteristics of the data: s21, constructing an encoder based on a residual error network for visual image data, wherein an input image is subjected to convolution and residual error block processing to remove an original full-connection classification layer and passes through a linear projection layer Mapping features into embedded vectors : , Wherein, the Representing input visual image data; Representing a residual network backbone extraction operation that does not include a full connection layer; A weight matrix representing a linear projection layer; Representing a matrix multiplication operation; Representing the output visual feature vector, D is a uniform feature dimension, Representing the real number domain; s22, constructing an encoder based on a geometrical feature extraction network of the point multilayer perceptron for LiDAR point cloud data, and setting an input point cloud as a point set Where M is the total number of points in the point cloud, each point ) Extracting features from each point independently by a multi-layer perceptron, and then applying a maximum pooling operation to ensure that the replacement of the input point sequence by the features is unchanged to obtain a point cloud feature vector : Wherein, MLP represents the operation of the multi-layer perceptron; Representing a maximum pooling operation; adjusting the extracted point cloud characteristics to a unified dimension D through a full connection layer; s23, constructing a complex convolutional neural network encoder for radio frequency signal data, wherein the complex convolutional neural network encoder comprises a plurality of complex residual blocks, and the input signal tensor is set as Wherein j is the imaginary unit, the output of the complex convolution layer The definition is as follows: Wherein, the And (3) with Representing the real and imaginary parts of the extracted complex tensor respectively, The real and imaginary parts of the complex convolution kernel, In order for the offset to be a function of, Representing convolution operation, then sequentially processing by complex batch normalization and nonlinear activation functions, and finally flattening the processed complex features and mapping by a projection layer to obtain radio frequency feature vectors 。
3. The method for integrating multimode and general sense based on artificial intelligence large model according to claim 2, wherein before entering S3, performing feature aggregation operation, namely obtaining single-mode feature vectors in steps S21 to S23 Stitching or stacking in feature dimensions to construct aggregated modality input features for gating decisions 。
4. A multi-modal integrated method based on artificial intelligence large model as claimed in claim 3, wherein step S3 comprises the following sub-steps: s31, calculating expert activation probability through a gating network, and inputting the aggregation mode into the features Inputting into a gating network, calculating a gating score of each expert i (i=1,., N) In order to realize sparse activation, only k experts with highest scores are reserved, and the weights of the rest experts are set to zero: Wherein, the For aggregating modal input features, the method comprises the step of combining all input modal information; A learnable weight matrix for a gating network; representing a matrix multiplication; representing the operation of reserving k elements with the largest value and setting other elements as minus infinity or zero; for normalizing the exponential function for converting the output into a probability distribution; s32, performing cross attention calculation in the activated expert network by firstly linearly projecting the input features to acquire attention components, setting the input of the expert network as By projection matrix Calculate query, key and value vector: Next, a cross-attention output a is calculated: wherein Q, K, V respectively represent query, key, value matrix, T represents matrix transposition operation; Is a scaling factor; is a normalized exponential function; subsequently, the attention output A is further processed through feed-forward network and layer normalization, and residual connection is introduced to obtain the final output characteristic of the expert network The calculation formula is as follows: ; wherein FFN represents a feed-forward network operation comprising a linear layer and a nonlinear activation function; A representation layer normalization operation; s33, outputting all activated expert And carrying out weighted summation to obtain a final multi-mode fusion feature y: ; wherein N is the total number of experts, i is the expert index; Representing scalar and vector multiplication operations; features are input for the aggregated modality defined in step S31.
5. The method for integrating multimode and ventilation based on artificial intelligence large model as claimed in claim 4, wherein the step S4 specifically comprises: S41, carrying out global average pooling or flattening operation on the multi-modal fusion feature y output in the step S33 to obtain a one-dimensional feature vector : ; Wherein, flat represents the operation of flattening the multidimensional tensor into a one-dimensional vector; s42, mapping the feature vector to the dimension of the target category through the full connection layer, wherein the mapping process is expressed as follows, assuming that the beam prediction task has C candidate beams in total: ; Wherein, the C is the total number of categories of the classification task; For the weight matrix of the task header, Is a bias vector; s43, selecting an activation function according to the task type, and for multi-classification tasks, using Function output predictive probability distribution Prediction probability of the u-th beam index The calculation is as follows: ; where u and j are both indices of beam classes, ; Representing vectors And e is a natural constant, thereby determining the beam index with the maximum probability as a final prediction result.
6. The method for integrating multimode and ventilation based on artificial intelligence large model according to claim 5, wherein step S5 specifically comprises: s51, measuring the difference between the model prediction result and the real label by adopting a cross entropy loss function: , wherein, C is the total number of categories and C is the category index; C-th element in the one-hot encoding vector which is the real label; c element in the prediction probability output in the step S43; S52, calculating a load balancing term based on a variation coefficient or auxiliary loss for preventing expert collapse, defining For the frequency selected by the ith expert in the current lot, The average gating probability for that expert, the load balancing penalty is intended to minimize the difference between these two distributions: , Where N is the total number of experts, the loss reaches a minimum when all the experts are uniformly selected and the probabilities are equal; S53, carrying out weighted summation on the loss of the two parts to construct a total objective function : , Wherein the method comprises the steps of Is a super parameter for adjusting the importance of load balancing by calculation And updating a model for the gradient of the network parameters by using an Adam optimizer until the loss function converges.

Description

Multi-mode and on-sense integrated method based on artificial intelligence large model Technical Field The invention relates to the technical field of multi-mode sensing and communication, in particular to a multi-mode sense-of-general integration method based on an artificial intelligence large model. Background Along with the development of everything interconnection, multimode sense integration has become a key enabling technology. Conventional single-mode sensing or communication methods have inherent limitations such as visual perception being susceptible to illumination and occlusion while wireless perception is often inadequate in resolution. By jointly utilizing heterogeneous modes such as Radio Frequency (RF), radar, vision, laser radar (LiDAR) and the like, the multi-mode general sense integrated system can utilize redundancy and complementarity of information, and obviously enhances the perception precision and communication robustness of the system in a complex scene, which is very important for safety-critical applications such as automatic driving, smart city, disaster response and the like. Although multimodal fusion has great potential, existing fusion schemes generally rely on static modal combinations, and are difficult to adapt to dynamically changing everything interconnection environments. Meanwhile, the artificial intelligence large model provides a new opportunity for solving the problems by virtue of the strong characteristic abstraction and generalization capability. However, the direct application of existing multi-modal large models to a sense-of-general integrated system faces serious challenges, in that most of the existing multi-modal large models are optimized for traditional data modalities (such as natural language, RGB images, audio and video), and lack native support for 6G-related modalities (such as RF signals, radar point clouds), and the high inference complexity brought by the huge parameter volumes of large models conflicts with the strict requirements of the sense-of-general integrated system for low latency and low energy consumption. Based on the above analysis, the core challenges faced by the prior art can be summarized as follows: (1) Heterogeneous modality support is insufficient, data (such as RF signals and LiDAR point clouds) in a 6G scene has unique physical characteristics and space-time resolution, and the existing multi-modality base model (such as CLIP and GPT-4V) cannot directly process the non-traditional modality data, so that feature extraction and alignment are difficult. (2) Dynamic environment adaptation is poor-in a realistic deployment, the sensor configuration is highly dynamic in space and time (e.g. visual failure due to circadian alternation, sensor maldistribution in different areas). The existing fixed fusion strategy cannot flexibly process arbitrary, time-varying modal combination inputs. (3) The contradiction between the calculation cost and the resource constraint is that the large model has excellent performance, but the calculation cost of full-parameter reasoning is extremely high. In the edge computing or real-time communication scenarios, it is difficult to implement efficient reasoning within a limited resource budget, limiting its landed application. Therefore, there is a need to develop a multi-modal integrated approach based on artificial intelligence large models to solve the above problems. Disclosure of Invention The invention aims to solve the problems and designs a multi-mode sense-of-general integration method based on an artificial intelligence large model. The invention realizes the above purpose through the following technical scheme: A multimode through sense integrated method based on an artificial intelligence large model comprises the following steps: S1, acquiring target scene data to construct a multi-mode sense integrated data input stream; s2, extracting heterogeneous characteristics based on a mode specific encoder; S3, performing cross-mode dynamic fusion based on a hybrid expert framework; s4, inputting the unified characteristics after fusion to a task-specific output head, and outputting a prediction result of a downstream task; S5, combining the task loss function and the multi-task combined training of the load balancing loss. Specifically, in step S2, the feature extraction process specifically includes the following sub-steps according to the physical characteristics of the data: s21, constructing an encoder based on a residual error network for visual image data, wherein an input image is subjected to convolution and residual error block processing to remove an original full-connection classification layer and passes through a linear projection layer Mapping features into embedded vectors: Wherein, the Representing input visual image data; Representing a residual network backbone extraction operation that does not include a full connection layer; A weight matrix representing a linear projection la