CN-121999087-A - Multi-mode image data reconstruction and multi-task identification combined method and system

CN121999087ACN 121999087 ACN121999087 ACN 121999087ACN-121999087-A

Abstract

The application discloses a multi-mode image data reconstruction and multi-task identification combined method and a system, which belong to the field of image processing, wherein the method comprises the steps of carrying out data preprocessing on an acquired multi-mode image in the same scene to obtain preprocessed multi-mode image data; the method comprises the steps of carrying out feature extraction on a preprocessed multi-mode image through a shared feature encoder, mining complementary information among modes by combining a cross-mode attention mechanism to generate shared features, carrying out hierarchical structural reconstruction based on the shared features to generate a reconstruction result, carrying out multi-task recognition by utilizing the shared features and the reconstruction result to construct a multi-task model, and constructing a joint loss function according to the reconstruction result and the multi-task recognition result. According to the application, through a depth collaboration mechanism, the internal association of images of different modes can be accurately mined, the mode difference and complementarity are dynamically balanced, and the value of each mode data is fully exerted.

Inventors

ZHANG BING
WANG SHUI
QIAN XIAOPAN
LI HAO
LU GUANGSHI

Assignees

安徽美图信息科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. A multi-modal image data reconstruction and multi-task recognition combination method, the method comprising: carrying out data preprocessing on the acquired multi-mode images in the same scene to obtain preprocessed multi-mode image data; Extracting features of the preprocessed multi-mode images through a shared feature encoder, and mining complementary information among modes by combining a cross-mode attention mechanism to generate shared features; based on the shared characteristics, carrying out hierarchical structured reconstruction to generate a reconstruction result; Utilizing the shared characteristics and the reconstruction result to carry out multi-task identification and construct a multi-task model; Constructing a joint loss function according to the reconstruction result and the multi-task identification result; and updating the parameters of the multi-task model by utilizing the joint loss function, and realizing collaborative optimization between shared feature extraction, hierarchical reconstruction and multi-task identification.
2. The method of claim 1, wherein the method comprises the steps of, The data preprocessing comprises the following steps: data alignment and calibration, noise removal and quality enhancement, region of interest extraction, modality feature normalization and data cleaning and anomaly handling.
3. The method of claim 1, wherein the method comprises the steps of, The feature extraction of the preprocessed multi-mode image by the shared feature encoder comprises: The convolution neural network with shared parameters is used as a shared feature encoder, and preliminary feature extraction is carried out on the images of each mode respectively to obtain primary feature images of different modes; and fusing the primary feature graphs of different modes to generate a unified primary shared feature representation.
4. The method of claim 3, wherein the method comprises the steps of, Mining complementary information among modes in combination with a cross-mode attention mechanism to generate sharing characteristics, wherein the method comprises the following steps: processing the primary shared feature images of different modes to obtain a multi-scale feature image; and fusing the multi-scale feature graphs by using a cross-modal attention mechanism to generate enhanced cross-modal features so as to obtain sharing features.
5. The method of claim 1, wherein the method comprises the steps of, The hierarchical structured reconstruction comprises pixel-level texture reconstruction, structure-level reconstruction and semantic-level structured reconstruction.
6. The method of claim 1, wherein the method comprises the steps of, The multi-task model comprises a target detection unit, a semantic segmentation unit and an attribute identification unit; The target detection unit is used for optimizing candidate frame generation and classification by utilizing the structural level profile information; The semantic segmentation unit is used for optimizing a segmentation boundary by utilizing pixel-level texture consistency and structure-level edge constraint; the attribute identification unit is used for carrying out multidimensional attribute reasoning by combining the pixel-level texture, the structure-level outline and the semantic-level label.
7. The method of claim 1, wherein the method comprises the steps of, Constructing a joint loss function according to the reconstruction result and the multi-task identification result, wherein the joint loss function comprises the following steps: Obtaining target detection loss, semantic segmentation loss, attribute identification loss and structural constraint loss according to the reconstruction result and the multitask identification result; And determining a joint loss function according to the target detection loss, the semantic segmentation loss, the attribute identification loss and the structural constraint loss.
8. The method of claim 1, wherein the method comprises the steps of, Updating the multitasking model parameters using the joint loss function, comprising: dividing the multitasking model parameters into sharing parameters, task parameters and structuring parameters; Based on the joint loss function, distributing weight to each loss item through a dynamic weight adjustment mechanism, and calculating corresponding gradient; and updating the parameters of different types according to the gradient.
9. The method of claim 8, wherein the method further comprises the steps of, Assigning weights to each loss term through a dynamic weight adjustment mechanism, including: and dynamically adjusting the weight coefficient of each loss term in the joint loss function according to the performance of the training process or the task.
10. A multi-modal image data reconstruction and multi-tasking identification joint system, the system comprising: The preprocessing module is used for preprocessing the acquired multi-mode image under the same scene to obtain preprocessed multi-mode image data; The feature module is used for extracting features of the preprocessed multi-mode images through the shared feature encoder, and mining complementary information among modes by combining a cross-mode attention mechanism to generate shared features; the reconstruction module is used for carrying out hierarchical structured reconstruction based on the shared characteristics to generate a reconstruction result; The identification module is used for carrying out multi-task identification by utilizing the shared characteristics and the reconstruction result to construct a multi-task model; the construction module is used for constructing a joint loss function according to the reconstruction result and the multi-task identification result; And the updating module is used for updating the parameters of the multi-task model by utilizing the joint loss function and realizing collaborative optimization between shared feature extraction, hierarchical reconstruction and multi-task identification.

Description

Multi-mode image data reconstruction and multi-task identification combined method and system Technical Field The application belongs to the field of image processing, and particularly relates to a multi-mode image data reconstruction and multi-task identification combined method and system. Background With the rapid development of artificial intelligence technology, image data is used as an important carrier for information transmission, and the processing and understanding demands are increasingly highlighted in various industries. From medical image diagnosis to automatic driving environment perception, from remote sensing monitoring to security monitoring, the scale and complexity of image data are continuously increased, and the traditional image processing and recognition technology is gradually difficult to meet the high-precision and high-efficiency requirements of practical application. Under the background, deep learning is a core power for promoting technical innovation in the image field by virtue of the strong characteristic automatic learning capability. As an important branch of machine learning, deep learning gets rid of the limitation that the traditional method relies on manual design features by constructing a multi-level neural network model, namely, the traditional image recognition often requires an expert to design feature extraction rules such as edges, textures and the like according to experience, so that time and effort are consumed, data errors are easily caused by subjective judgment, the deep learning model can directly learn complex feature modes from massive data autonomously, and particularly, the deep learning model has obvious advantages when processing high-dimensional and nonlinear image data. Among them, the advent of Convolutional Neural Networks (CNNs) has brought a breakthrough in the field of images. According to the deep learning model for simulating the biological vision system, local image features are captured through multi-layer convolution operation, feature degradation and spatial invariance are realized by combining pooling operation, and multi-level features from bottom-layer pixels to high-layer semantics can be automatically mined. For example, in an image classification task, the CNN can learn from the original pixels step by step to the feature representation of the edge, contour, part up to the complete target, and in a target detection task, it can precisely locate the target position and identify the class. By virtue of excellent feature extraction capability and generalization performance, CNN has become a mainstream method in the field of image recognition, and is widely applied to scenes such as animal and plant recognition, face recognition and industrial defect detection, and the problem of insufficient precision of a traditional method in a complex scene is effectively solved. However, in practical applications, the image data often presents a "multi-mode" characteristic—the same object or scene may generate multiple types of images by different sensors (e.g., visible light camera, infrared sensor) and different imaging modes (e.g., RGB image, infrared image). These modal data differ in feature distribution, resolution, information emphasis (e.g., infrared maps are good at capturing temperature information, visible maps are heavy in color textures), but involve complementary correlations (e.g., cross-modal correspondence as with an object). The traditional deep learning method is mainly aimed at single-mode image design or simply spliced multi-mode data, so that collaborative information among modes is difficult to fully mine, and meanwhile, image reconstruction (such as repairing blurring, complementing missing and fusing multi-mode information) and recognition (such as detection, segmentation and attribute judgment) are usually carried out step by step as independent tasks, so that loss exists in the information transmission process, mutual guidance and optimization among tasks cannot be realized, and bottlenecks exist in processing efficiency and accuracy. Therefore, how to realize the efficient fusion of multi-mode image data and the collaborative optimization of hierarchical structured reconstruction and multi-task recognition based on the deep learning technology becomes a key direction for solving the image understanding problem in a complex scene. The prior art has the defects that the picture processing and the picture identification are respectively carried out independently and cannot be carried out synchronously, so that the speed and the efficiency of processing the pictures are lower. (1) Multimodal data processing flows on the surface and the collaborative information mining is insufficient The prior art has a simpler and coarser processing mode for the multi-mode image data. In one aspect, most methods only design a model for a single modality image, such as processing a visible or infrared image alone, completely ignoring compl