CN-121982464-A - Multi-mode large-model driven complex scene image data processing method and system

CN121982464ACN 121982464 ACN121982464 ACN 121982464ACN-121982464-A

Abstract

The invention relates to the technical field of artificial intelligence and image processing, and discloses a multi-mode large-model-driven complex scene image data processing method and system. According to the method, the target image data is acquired, the multi-scale semantic features are extracted, feature coding is performed based on a scene context dependency graph, optimized feature representation is generated through attention-directed feature fusion and multi-level semantic reasoning, and target processing operation is executed. The invention can effectively improve the understanding accuracy of the complex scene image, enhance the feature expression capability, improve the performance of the model in the aspects of space relation reasoning and semantic analysis, and is suitable for complex visual scene analysis and decision.

Inventors

LIANG JINE

Assignees

北京华盛恒辉科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (10)

1. The multi-mode large model driven complex scene image data processing method is characterized by comprising the following steps of: acquiring target image data to be processed, and performing cross-modal feature extraction and feature analysis on the target image data to obtain multi-scale semantic feature representation; Determining a scene context dependency graph based on the multi-scale semantic feature representation, and performing explicit coding on a spatial association relationship and a semantic dependency relationship in the multi-scale semantic feature representation based on the scene context dependency graph to obtain a context enhancement feature representation; According to the context enhancement feature representation and a preset task constraint condition, carrying out weighted aggregation on the context enhancement feature representation through attention-directed feature fusion to generate a fusion feature vector; Carrying out multi-level semantic reasoning on the fusion feature vector, and carrying out iterative weight adjustment on different semantic dimensions in the fusion feature vector through dynamic feature recalibration to obtain task-oriented optimized feature representation; And executing target processing operation based on the task-oriented optimization feature representation to obtain a processing result, and carrying out self-adaptive updating on the scene context dependency graph by utilizing the quality evaluation feedback information of the processing result.
2. The method of claim 1, wherein obtaining target image data to be processed, performing cross-modal feature extraction and feature analysis on the target image data to obtain a multi-scale semantic feature representation, comprises: Carrying out multi-mode information separation on the target image data, and decomposing the target image data into a visual space information component and a semantic content information component; respectively carrying out feature coding on the visual space information component and the semantic content information component, and carrying out feature extraction on different spatial resolution levels through cross-scale feature sampling to obtain a visual space feature sequence and a semantic content feature sequence; Determining a cross-modal semantic correspondence between the visual space feature sequence and the semantic content feature sequence, and performing association alignment on space position information in the visual space feature sequence and semantic category information in the semantic content feature sequence through bidirectional semantic mapping to obtain cross-modal alignment features; And carrying out multi-scale feature fusion on the cross-modal alignment features based on the cross-modal semantic correspondence, and aggregating the cross-modal alignment features of different spatial resolution levels according to scale level relations to generate the multi-scale semantic feature representation.
3. The method of claim 1, wherein determining a context dependency graph based on the multi-scale semantic feature representation, explicitly encoding spatial and semantic dependencies in the multi-scale semantic feature representation based on the context dependency graph, resulting in a context enhanced feature representation, comprises: Extracting a semantic node set from the multi-scale semantic feature representation, determining semantic association strength between semantic nodes in the semantic node set through semantic similarity calculation, and establishing a connection relationship between the semantic nodes based on the semantic association strength to obtain an initial scene graph structure; performing spatial position analysis on the semantic node set in the initial scene graph structure, and determining the spatial association strength between semantic nodes through spatial proximity calculation; Adding the spatial association strength as an edge weight into the connection relation of the initial scene graph structure to obtain the scene context dependency graph; performing graph convolution coding on the multi-scale semantic feature representation based on the connection relation in the scene context dependency graph, and performing feature aggregation by taking the semantic association strength and the spatial association strength of each semantic node in the scene context dependency graph as propagation weights to obtain graph coding features; And adding the spatial association relation and the semantic dependency relation coded in the graph coding feature into the multi-scale semantic feature representation through residual connection, and generating the context enhancement feature representation.
4. A method according to claim 3, wherein extracting a semantic node set from the multi-scale semantic feature representation, determining semantic association strength between semantic nodes in the semantic node set through semantic similarity calculation, and establishing a connection relationship between semantic nodes based on the semantic association strength, to obtain an initial scene graph structure, comprises: Carrying out semantic cluster analysis on the multi-scale semantic feature representation, merging feature vectors with common semantic attributes in the multi-scale semantic feature representation into semantic clusters, and taking the central feature vector of each semantic cluster as a candidate semantic node to obtain a candidate semantic node set; Carrying out semantic significance evaluation on the candidate semantic node sets, determining significance scores of all candidate semantic nodes, and screening the candidate semantic node sets based on the significance scores to obtain the semantic node sets; Performing semantic association calculation on each semantic node in the semantic node set every two, determining a corresponding relation of each semantic node in a semantic subspace through feature space vector projection, and determining the semantic association strength based on the projection distance and direction consistency of the corresponding relation; Setting a connection threshold based on the semantic association strength, establishing directed edge connection for semantic node pairs with semantic association strength exceeding the connection threshold in the semantic node set, and taking the semantic association strength as edge weight of the directed edge connection to obtain the initial scene graph structure.
5. The method of claim 1, wherein generating a fused feature vector by weighted aggregation of the context enhanced feature representation through attention directed feature fusion according to the context enhanced feature representation and a preset task constraint comprises: Carrying out semantic analysis on the preset task constraint conditions, extracting key constraint elements in the task constraint conditions, and mapping the key constraint elements into constraint feature codes to obtain task constraint feature representations; Performing cross attention computation on the context enhancement feature representation based on the task constraint feature representation, generating attention weight distribution by computing semantic matching degree between each feature component in the context enhancement feature representation and the task constraint feature representation, and determining contribution degree of each feature component in a feature fusion process according to the attention weight distribution; Performing weighted aggregation on each feature component in the context enhancement feature representation based on the attention weight distribution, and performing weighted summation on each feature component in the context enhancement feature representation by taking the contribution degree as a weighting coefficient to obtain a weighted aggregation feature; And carrying out nonlinear transformation on the weighted aggregation features, projecting the weighted aggregation features to a target feature space, and generating the fusion feature vector.
6. The method of claim 1, wherein performing multi-level semantic reasoning on the fused feature vector and performing iterative weight adjustment on different semantic dimensions in the fused feature vector through dynamic feature recalibration to obtain a task-oriented optimized feature representation comprises: carrying out semantic dimension decomposition on the fusion feature vector, dividing the fusion feature vector into a plurality of semantic subspaces along the semantic dimension, respectively determining reasoning paths for each semantic subspace, and carrying out multi-level semantic reasoning on the fusion feature vector through parallel processing of a plurality of reasoning paths to obtain a multi-path reasoning result; Carrying out semantic consistency assessment on the multipath reasoning results, determining reliability scores of all reasoning paths by calculating semantic consistency degrees among the output results of all reasoning paths, and generating initial weight vectors based on the reliability scores; recalibrating different semantic dimensions in the fusion feature vector based on the initial weight vector to generate a recalibrated feature vector; residual feedback is carried out on the recalibration feature vector and the multipath reasoning result so as to update the initial weight vector, and a target weight vector is obtained; And constructing semantic dimension reorganization mapping based on the target weight vector, selectively enhancing and inhibiting feature components of each semantic dimension in the fusion feature vector through the semantic dimension reorganization mapping, and generating the task-oriented optimized feature representation.
7. The method of claim 6, wherein performing semantic dimension decomposition on the fused feature vector, dividing the fused feature vector into a plurality of semantic subspaces along a semantic dimension, determining inference paths for the semantic subspaces respectively, performing multi-level semantic inference on the fused feature vector through parallel processing of the plurality of inference paths, and obtaining a multi-path inference result, wherein the method comprises: Carrying out semantic attribute analysis on the fusion feature vector, and determining feature components with different semantic attributes in the fusion feature vector; decomposing the fusion feature vector into a plurality of semantic subspaces along a semantic dimension according to the semantic attribute type of the feature component, wherein each semantic subspace corresponds to one semantic attribute type; analyzing semantic dependency relationships and logic transfer relationships among feature components in each semantic subspace, and determining a network topology structure of a semantic reasoning network; determining an inference path for each semantic subspace based on a network topology structure of the semantic inference network, wherein each inference path corresponds to a semantic propagation link from input features to inference output; Carrying out parallel reasoning processing on each reasoning path, and carrying out semantic propagation and logic reasoning on feature components in a semantic subspace corresponding to each reasoning path to obtain the reasoning output of each reasoning path; and carrying out semantic fusion on the reasoning output of each reasoning path, and carrying out aggregation processing on the reasoning output of each reasoning path at a semantic level to generate the multipath reasoning result.
8. A multi-modal large model driven complex scene image data processing system for implementing the method of any of claims 1-7, comprising: the first unit is used for acquiring target image data to be processed, and performing cross-modal feature extraction and feature analysis on the target image data to obtain multi-scale semantic feature representation; a second unit, configured to determine a context dependency graph of the scene based on the multi-scale semantic feature representation, and perform explicit coding on a spatial association relationship and a semantic dependency relationship in the multi-scale semantic feature representation based on the context dependency graph of the scene, so as to obtain a context enhancement feature representation; The third unit is used for carrying out weighted aggregation on the context enhancement feature representation through attention-guided feature fusion according to the context enhancement feature representation and a preset task constraint condition to generate a fusion feature vector; the fourth unit is used for carrying out multi-level semantic reasoning on the fusion feature vector, and carrying out iterative weight adjustment on different semantic dimensions in the fusion feature vector through dynamic feature recalibration to obtain task-oriented optimized feature representation; and a fifth unit, configured to perform a target processing operation based on the task-oriented optimization feature representation to obtain a processing result, and adaptively update the context dependency graph by using quality evaluation feedback information of the processing result.
9. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.

Description

Multi-mode large-model driven complex scene image data processing method and system Technical Field The invention relates to the technical field of artificial intelligence and image processing, in particular to a multi-mode large model driven complex scene image data processing method and system. Background With the rapid development of computer vision technology, complex scene image data processing has become an important research direction in the field of artificial intelligence. The traditional image processing method mainly relies on single-mode feature extraction, and cannot effectively process rich semantic information in complex scenes. In recent years, with the progress of deep learning technology, in particular to the appearance of a multi-mode large model, a new technical path is provided for complex scene image data processing. The multi-mode large model can process multiple mode information such as vision, text and the like at the same time, and more comprehensive scene understanding and analysis are realized through cross-mode feature fusion. There are many methods for deep learning in the current image processing field, including convolutional neural networks, transducer architecture, and various attention mechanisms. The method has remarkable results in tasks such as image classification, target detection, semantic segmentation and the like. However, the prior art still faces many challenges for image processing of complex scenes, especially in terms of multi-scale semantic understanding, contextual reasoning, feature fusion, and so on. The main defects and shortcomings of the prior art are shown in the following aspects that the traditional image processing method is difficult to effectively model context dependency relations in complex scenes, particularly spatial association and semantic dependency among different visual elements, so that the image processing method is poor in performance in processing images with rich scene information. The existing feature fusion technology lacks dynamic self-adaptation capability, and feature weights cannot be flexibly adjusted according to different task demands, so that the adaptability and accuracy of the model in various application scenes are limited. Most of the existing methods lack an effective feedback mechanism, cannot adaptively adjust model parameters based on processing results, and are difficult to continuously optimize performance in the complex scene processing process. Disclosure of Invention The embodiment of the invention provides a multi-mode large-model driven complex scene image data processing method and system, which can solve the problems in the prior art. In a first aspect of an embodiment of the present invention, a method for processing multi-modal large model driven complex scene image data is provided, including: acquiring target image data to be processed, and performing cross-modal feature extraction and feature analysis on the target image data to obtain multi-scale semantic feature representation; Determining a scene context dependency graph based on the multi-scale semantic feature representation, and performing explicit coding on a spatial association relationship and a semantic dependency relationship in the multi-scale semantic feature representation based on the scene context dependency graph to obtain a context enhancement feature representation; According to the context enhancement feature representation and a preset task constraint condition, carrying out weighted aggregation on the context enhancement feature representation through attention-directed feature fusion to generate a fusion feature vector; Carrying out multi-level semantic reasoning on the fusion feature vector, and carrying out iterative weight adjustment on different semantic dimensions in the fusion feature vector through dynamic feature recalibration to obtain task-oriented optimized feature representation; And executing target processing operation based on the task-oriented optimization feature representation to obtain a processing result, and carrying out self-adaptive updating on the scene context dependency graph by utilizing the quality evaluation feedback information of the processing result. Acquiring target image data to be processed, performing cross-modal feature extraction and feature analysis on the target image data to obtain multi-scale semantic feature representation, wherein the method comprises the following steps: Carrying out multi-mode information separation on the target image data, and decomposing the target image data into a visual space information component and a semantic content information component; respectively carrying out feature coding on the visual space information component and the semantic content information component, and carrying out feature extraction on different spatial resolution levels through cross-scale feature sampling to obtain a visual space feature sequence and a semantic content feature sequence