CN-121980156-A - Multi-mode data intelligent analysis decision-making system and method based on generation type AI

CN121980156ACN 121980156 ACN121980156 ACN 121980156ACN-121980156-A

Abstract

The invention discloses a multi-mode data intelligent analysis decision system and method based on a generation type AI, which relate to the technical field of multi-mode data processing and comprise a prealignment feature extraction module, a time-space encoder, a semantic tag binding module and a data processing module, wherein the time-space encoder is configured for carrying out time stamp synchronization on input multi-mode data comprising images, texts and audio; the method comprises the steps of repairing OCR recognition errors in a fuzzy scanning piece by adopting a residual error compensation mechanism, extracting features, outputting a feature vector group with normalized dimensionality, constructing a dual-channel countermeasure projection network, enabling a main channel to realize feature space mapping through a cross attention weight matrix, enabling an auxiliary channel to introduce a modal difference loss function to calibrate feature distribution, enabling a generating type decision engine to integrate a multi-modal large language model and a causal reasoning layer to generate a decision scheme by fusing features, and enabling a credibility verification unit to backtrack and check consistency of a decision logic chain through a knowledge graph. The method reduces the modal deviation of heterogeneous data fusion and improves the decision efficiency and reliability.

Inventors

ZHANG JINGSHENG

Assignees

杭州盟爵科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251127

Claims (10)

1. A multi-modal data intelligent analysis decision system based on a generated AI, comprising: the prealignment feature extraction module is used for configuring a space-time encoder, carrying out time stamp synchronization and semantic tag binding on input multi-mode image, text and audio data, repairing OCR recognition errors in a fuzzy scanning piece by adopting a residual error compensation mechanism, extracting features and outputting a feature vector group with normalized dimension; The heterogeneous projection module is used for constructing a dual-channel countermeasure projection network, wherein the main channel realizes characteristic space mapping through a cross attention weight matrix; The generation type decision engine integrates a multi-mode large language model and a causal reasoning layer, generates a decision scheme based on fusion characteristics, deploys a dynamic credibility verification unit and backtracks and verifies consistency of a decision logic chain through a knowledge graph.
2. The multi-modal data intelligent analysis decision system based on the generated AI of claim 1, wherein the pre-alignment feature extraction module specifically comprises: the method comprises the steps of time stamp analysis and alignment, namely, carrying out time metadata analysis on an input data stream, wherein an image is subjected to extraction of a shooting time stamp in EXIF information, text is subjected to analysis of an ISO8601 format time code of a log header, audio is marked according to an audio frame header PTS, and dynamic offset of the image/text stream is calculated by using the audio stream as a reference time axis by using a dynamic time warping algorithm; The method comprises the steps of semantic label binding, namely calling a pre-training semantic classifier, labeling scene labels for images, labeling entity types for texts, labeling event labels for audios, cross-modal label association, and creating a mapping relation through a shared label pool to generate a unique semantic identifier.
3. The multi-modal data intelligent analysis decision system based on generation AI of claim 2, wherein the pre-alignment feature extraction module further comprises: The residual error compensation mechanism comprises fuzzy text region detection and condition generation countermeasure network reconstruction; Fuzzy text region detection: calculating curvature mutation points of a character skeleton diagram by using an improved stroke continuity analysis algorithm, and marking a fuzzy area when a plurality of continuous pixel points are marked; condition generation against network reconfiguration: the generator G comprises an encoder, a decoder, a TransposedCNN, a residual jump connection is introduced, wherein the local features are extracted by 7-layer convolution; the discriminator D comprises a spliced image of a text region and an original document format which are input and generated, and the countermeasure training is stabilized through spectrum normalization; the loss function includes calculating reconstruction loss, antagonism loss, layout consistency loss.
4. The multi-modal data intelligent analysis decision system based on generation AI of claim 3, wherein the pre-alignment feature extraction module further comprises: The feature extraction comprises an image branch, a text branch, an audio branch, a Mel spectrogram input ConvNeXt-Tiny, a global average pooling output multidimensional vector Fa, wherein the image branch uses a EFFICIENTNET-B5 backbone network to output the multidimensional feature vector F v ; and D, dimension normalization, namely, dynamic weight projection, outputting a normalized feature vector group F, wherein the dimensions of the vectors are unified to 512, and the L2 norm is 1.
5. The intelligent analysis decision system of multi-modal data based on generated AI of claim 4, wherein the heterogeneous projection module specifically comprises: The heterogeneous projection module is used for constructing a dual-channel countermeasure projection network, wherein the main channel realizes characteristic space mapping through a cross attention weight matrix; Modal feature decoupling: The method comprises the steps of receiving a normalized feature group output by a pre-alignment module, decomposing each mode feature into public components through a decoupling encoder, carrying cross-mode sharing semantic information, preserving mode specificity details by a private component, and constructing a public feature matrix and private feature splicing through decoupling loss function constraint and component recombination.
6. The intelligent analysis decision system based on generated AI of claim 5, wherein the heterogeneous projection module further comprises: the main channel is used for calculating query-key value pairs to generate a cross attention matrix, carrying out weighted fusion on public components and dimension reduction aggregation on the characteristic space mapping; The auxiliary channel is constructed by a modal classifier, a three-layer MLP classifier is designed, common components are input, and multi-modal class probability is output; The method comprises the steps of double-channel joint optimization, feature splicing normalization calculation, multi-target joint training, total loss function calculation, gradient inversion layer, and gradient sign inversion of counterloss during counter propagation of an auxiliary channel.
7. The intelligent analysis decision system for multimodal data based on generated AI of claim 6, wherein the generated decision engine specifically comprises: And (3) fusion feature decoding: The prompt word engineering packaging receives the fusion feature vector output by the heterogeneous projection module, and constructs a structured prompt template, namely, the current multi-mode feature code is used for generating a decision scheme based on the feature, and the key event description, the execution action sequence and the expected influence analysis are needed to be included; The multi-mode large language model reasoning adopts MLLM of MoE architecture, the basic model is LLaVA-1.513B image-text alignment pre-training, expert routing dynamically activates the expert according to the mode weight of the fusion feature vector, the expert comprises a visual expert, processes spatial relationship description, a logic expert generates an action sequence, a risk expert evaluates decision results, and carries out temperature coefficient self-adaptive regulation.
8. The generated AI-based multimodal data intelligent analysis decision system of claim 7 wherein the generated decision engine further comprises: Causal reasoning layer: The decision scheme is resolved into a causal graph, causal elements in the decision scheme are extracted by using a semantic resolver, the causal elements comprise causal nodes, fruit nodes and intermediate variables, a directed acyclic graph is constructed, the Do-Calculus can be used for performing executable verification, calculating intervention effects and carrying out inverse fact reasoning, and a corrected decision is output.
9. The generated AI-based multimodal data intelligent analysis decision system of claim 8 wherein the generated decision engine further comprises: Dynamic credibility verification: The method comprises the steps of aligning a knowledge graph triplet, extracting the triplet from a decision scheme, wherein the triplet comprises a main body, a device number/personnel ID, predicates, actions/states, objects/parameter values, matching and checking a graph neural network, executing sub-graph matching in a domain knowledge graph, scoring consistency, performing feedback correction, resisting sample defense, adding random disturbance to an input fusion feature vector, and judging as a fragile inference chain if the change rate of the decision scheme is larger than a preset proportion.
10. The multi-mode data intelligent analysis decision-making method based on the generated AI is characterized by comprising the following steps: Inputting image, text and audio multi-modal data streams, aligning to the same time axis through a dynamic time warping algorithm, generating a cross-modal tag group by adopting a semantic binding model, and executing OCR residual error compensation based on stroke curvature on the text data; Decomposing each mode characteristic into a public component and a private component, weighting and fusing the public component by a main channel through a cross attention moment array, and minimizing mode classification loss by an auxiliary channel through a gradient inversion layer; And (3) fusing the features, inputting a mixed expert model to generate a decision draft, dynamically backtracking and verifying, extracting a decision triplet to be matched with the knowledge graph subgraph, and if the similarity is smaller than a preset threshold, injecting a counter fact condition to regenerate a decision.

Description

Multi-mode data intelligent analysis decision-making system and method based on generation type AI Technical Field The invention relates to the technical field of multi-mode data processing, in particular to a multi-mode data intelligent analysis decision system and method based on a generated AI. Background The popularity of the Transformer architecture has driven a leap forward of the multi-modal content generation capabilities. In recent years, the multi-modal large model realizes image-text joint modeling through cross-modal alignment learning, but has core problems of illusion (hallucination) and the like. The industrial Internet, smart city and other scenes produce massive heterogeneous data, and the traditional single-mode analysis cannot meet the complex decision requirement. For example, military fields need to integrate information texts, satellite images and radar signals for comprehensive research and judgment. The development of domestic GPUs and distributed training frameworks enables the training and deployment of billions of parametric models. Advances in edge computing technology have driven the ground of real-time decision making systems. Therefore, a system and a method for intelligently analyzing and deciding multi-mode data based on generated AI are needed, and are faced with the deep bottleneck of multi-mode data fusion, including large heterogeneous feature difference, obvious feature space difference of images, texts and audios, direct splicing is easy to cause semantic confusion, difficult problem of data alignment, and the problem of dislocation of multi-mode data in time and semantics is common, for example, OCR recognition errors in fuzzy scanning pieces may interfere with knowledge graph construction. Disclosure of Invention In order to solve the technical problems, the technical scheme solves the problems of large heterogeneous characteristic difference and difficult data alignment by providing a multi-mode data intelligent analysis decision system and method based on the generated AI. In order to achieve the above purpose, the invention adopts the following technical scheme: A multi-modal data intelligent analysis decision system based on a generated AI, comprising: the prealignment feature extraction module is used for configuring a space-time encoder, carrying out time stamp synchronization and semantic tag binding on input multi-mode image, text and audio data, repairing OCR recognition errors in a fuzzy scanning piece by adopting a residual error compensation mechanism, extracting features and outputting a feature vector group with normalized dimension; The heterogeneous projection module is used for constructing a dual-channel countermeasure projection network, wherein the main channel realizes characteristic space mapping through a cross attention weight matrix; The generation type decision engine integrates a multi-mode large language model and a causal reasoning layer, generates a decision scheme based on fusion characteristics, deploys a dynamic credibility verification unit and backtracks and verifies consistency of a decision logic chain through a knowledge graph. Preferably, the pre-alignment feature extraction module specifically includes: the method comprises the steps of time stamp analysis and alignment, namely, carrying out time metadata analysis on an input data stream, wherein an image is subjected to extraction of a shooting time stamp in EXIF information, text is subjected to analysis of an ISO8601 format time code of a log header, audio is marked according to an audio frame header PTS, and dynamic offset of the image/text stream is calculated by using the audio stream as a reference time axis by using a dynamic time warping algorithm; The method comprises the steps of semantic label binding, namely calling a pre-training semantic classifier, labeling scene labels for images, labeling entity types for texts, labeling event labels for audios, cross-modal label association, and creating a mapping relation through a shared label pool to generate a unique semantic identifier. Preferably, the pre-alignment feature extraction module further includes: The residual error compensation mechanism comprises fuzzy text region detection and condition generation countermeasure network reconstruction; Fuzzy text region detection: calculating curvature mutation points of a character skeleton diagram by using an improved stroke continuity analysis algorithm, and marking a fuzzy area when a plurality of continuous pixel points are marked; condition generation against network reconfiguration: the generator G comprises an encoder, a decoder, a TransposedCNN, a residual jump connection is introduced, wherein the local features are extracted by 7-layer convolution; the discriminator D comprises a spliced image of a text region and an original document format which are input and generated, and the countermeasure training is stabilized through spectrum normalization; the loss func