CN-121997989-A - Visual model construction method and system based on modularized neural network architecture

CN121997989ACN 121997989 ACN121997989 ACN 121997989ACN-121997989-A

Abstract

The invention relates to the technical field and discloses a visual model construction method and a visual model construction system based on a modularized neural network framework, wherein the visual model construction method and the visual model construction system comprise the steps of receiving task requirements and performance standards input by a client and analyzing a subtask set to be executed through a semantic understanding module; aiming at each subtask, a corresponding specialized neural network module is automatically constructed and trained, wherein each module uses a lightweight convolution structure and performs independent customized training by adopting a specific loss function to generate a customized pre-training model corresponding to each specialized neural network module, performance evaluation is performed on each customized pre-training model, optimization is performed on each customized pre-training model according to an evaluation result, the optimized customized pre-training models reaching standards are integrated to form a central fusion model which finally meets the requirements of a client, and a final processing result aiming at the input task requirements is output.

Inventors

LI JIAJUN
LI JIAZHENG
LI JIAHUI

Assignees

佳美惠通科技(深圳)有限公司

Dates

Publication Date: 20260508
Application Date: 20250911

Claims (10)

1. A visual model construction method based on a modularized neural network architecture is characterized by comprising the following steps: S1, receiving task requirements and performance standards input by a client, and analyzing a subtask set to be executed through a semantic understanding module, wherein the subtask comprises but is not limited to edge detection, color analysis and shape recognition; S2, aiming at each subtask, automatically constructing and training a corresponding specialized neural network module, wherein each module uses a lightweight convolution structure and adopts a specific loss function to perform independent customized training so as to generate a customized pre-training model corresponding to each specialized neural network module; s3, performing performance evaluation on each customized pre-training model, and optimizing the customized pre-training models according to evaluation results; and S4, integrating the optimized customized pre-training model to form a central fusion model which finally meets the requirements of the client, and outputting a final processing result aiming at the input task requirements.
2. The method for constructing a visual model based on a modular neural network architecture according to claim 1, wherein in step S2, the training process of the specialized neural network module includes: automatically generating subtask related synthetic data using an antagonism generation network to enhance the generalization capability of the module; independently configuring network depth, an activation function and a loss function for each module according to the characteristics of the subtasks; The strategy of dynamically adjusting the learning rate is adopted for training so as to ensure that each specialized neural network module can achieve the optimal performance on the specific characteristic extraction task.
3. The method of claim 2, wherein the configuration of the loss function further comprises: For a module corresponding to the edge detection task, a convolutional neural network based on an encoder-decoder structure is adopted for realizing, and a mixed loss function combining cross entropy loss and Dice loss is used for training; For a module corresponding to a color analysis task, a structure based on a full convolution network is adopted for realizing, and a loss function based on color distribution difference is used for training; And for the module corresponding to the shape recognition task, the model is realized by adopting a deep convolutional neural network, and the model is trained by using a multi-category cross entropy loss function.
4. The method of claim 1, wherein in step S3, performing performance evaluation on each customized pre-training model further comprises: S31, evaluating the current performance of each customized prediction model by using a set of reserved verification data sets; s31, if the performance of the model does not reach the standard, automatically adjusting the super-parameters of the training or supplementing new training data for the model, and retraining; And S33, repeating the steps S31-S32 until the performances of all the customized prediction models meet the preset standard, wherein the preset standard is jointly defined by the performance standard and the efficiency index which are proposed by the client.
5. The method for constructing a visual model based on a modular neural network architecture according to claim 1, wherein in step S4, the integration of the central fusion model specifically comprises: the model fusion is carried out by adopting a feature fusion mode based on an attention mechanism, and the attention weight calculation mode of each customized pre-training model is expressed as follows: Wherein, the Represent the first The feature vectors output by the individual customized pre-training models, Representing a matrix of the learning parameters, As a result of the offset vector, In order to pay attention to the weight vector, Representation of representation No And the attention weight of the individual customized pre-training model is used for weighting and summing to obtain the fusion characteristic.
6. The method of claim 5, wherein the integrating of the central fusion model further comprises: Establishing a shared potential space mapping function, and mapping heterogeneous feature vectors output by each customized pre-training model to the same measurement space; The mapping function is jointly optimized with the central fusion model through an antagonism training process, so that features which come from different customized pre-training models and are related to semantics are similar in distance in potential space, and uncorrelated features are far away, and therefore effective alignment and splicing of cross-modal features are achieved.
7. The method for constructing a visual model based on a modular neural network architecture according to claim 1, wherein in step S4, a final processing result is output, and a corresponding interpretable report is output, and the report is traced back to decision interpretation information of each specialized neural network module, specifically including: The contribution degree weight of each specialized neural network module to the current decision, the visual thermodynamic diagram of the basic visual features extracted by each module on the input image, and the visual identification output by the abnormal submodule.
8. A modular neural network architecture-based vision model building system, comprising: The task receiving and analyzing module is used for receiving task requirements and performance standards input by the client, and analyzing a subtask set to be executed through the semantic understanding module, wherein the subtasks comprise but are not limited to edge detection, color analysis and shape recognition; The model building and training module is used for automatically building and training a corresponding specialized neural network module aiming at each subtask, wherein each module uses a lightweight convolution structure and adopts a specific loss function to perform independent customized training so as to generate a customized pre-training model corresponding to each specialized neural network module; the performance analysis and optimization module is used for performing performance evaluation on each customized pre-training model and optimizing the customized pre-training models according to the evaluation results; And the model integration and interaction module is used for integrating the optimized customized pre-training model meeting the standard to form a central fusion model finally meeting the requirements of the client, and outputting a final processing result aiming at the input task requirements.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements a modular neural network architecture-based vision model building method according to any one of claims 1-8.
10. An electronic device comprising one or more processors and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the modular neural network architecture-based vision model building method of any of claims 1-8.

Description

Visual model construction method and system based on modularized neural network architecture Technical Field The invention relates to the technical field of neural network architecture, in particular to a visual model construction method and system based on a modularized neural network architecture. Background In recent years, computer vision techniques based on deep learning have made significant progress, wherein single, monolithic Convolutional Neural Network (CNN) models (e.g., VGG, res net, etc.) employing end-to-end training approaches have become the dominant implementation. Such models are typically pre-trained on large-scale generic data sets (such as ImageNet), learn to extract hierarchical features from low-level to high-level, and migrate to specific tasks through Fine-tuning (Fine-tuning). However, there are several inherent drawbacks to this conventional monolithic model architecture. First, it is computationally expensive, requires significant computational resources and time to train and deploy a massive monolithic network, and is difficult to run efficiently on resource-constrained edge devices. Second, models are poorly interpretable, and their decision process resembles a "black box", making it difficult to trace back which features of the input are responsible for a particular output, which constitutes a serious application barrier in high risk areas such as medical, autopilot, etc. Thirdly, the model severely depends on a large amount of high-quality annotation data, and fitting is easy to occur in a scene with scarce data, so that generalization capability is insufficient. In addition, such models are less fault tolerant and failure or performance degradation of components in the network may directly lead to overall system failure. To overcome the limitations described above, the prior art attempts to employ transfer learning, i.e. adapting models pre-trained on a generic dataset to new fields by fine tuning. But this approach also faces significant challenges in that the advanced features learned by the pre-training model are highly dependent on the source data domain (e.g., natural images) and the features in the depth filter are almost disabled when the target domain differs significantly from the source domain (e.g., medical images, telemetry images). The fine tuning process essentially requires significant computational overhead to cover the original, irrelevant knowledge of the model, inefficiencies, nearly equivalent to retraining. At present, in the aspect of constructing a specialized and modularized model system, an algorithm engineer is still highly dependent on manual task decomposition, module design and joint debugging. The manual driving flow is not only low in efficiency, but also difficult to scale, and the system architecture cannot be dynamically constructed and optimized according to task requirements. Thus, there is a great need in the art for a solution that allows to automatically design, train, integrate and optimize a series of specialized modules, and eventually to form an efficient, robust and interpretable vision processing system. Disclosure of Invention The invention aims to solve the defects in the prior art, and provides a visual model construction method and a visual model construction system based on a modularized neural network architecture, which automatically divide instructions into a plurality of tasks, automatically train a plurality of specialized smaller neural network modules for each task, integrate the output of the specialized smaller neural network modules, and realize efficient, interpretable and robust computer visual processing. In one aspect, a method for constructing a visual model based on a modularized neural network architecture is provided, including the following steps: S1, receiving task requirements and performance standards input by a client, and analyzing a subtask set to be executed through a semantic understanding module, wherein the subtask comprises but is not limited to edge detection, color analysis and shape recognition; S2, aiming at each subtask, automatically constructing and training a corresponding specialized neural network module, wherein each module uses a lightweight convolution structure and adopts a specific loss function to perform independent customized training so as to generate a customized pre-training model corresponding to each specialized neural network module; s3, performing performance evaluation on each customized pre-training model, and optimizing the customized pre-training models according to evaluation results; and S4, integrating the optimized customized pre-training model to form a central fusion model which finally meets the requirements of the client, and outputting a final processing result aiming at the input task requirements. Further, in step S2, the training process of the specialized neural network module includes: automatically generating subtask related synthetic data using an anta