CN-121982478-A - Model optimization method, device, equipment and medium based on multi-scale attention fusion

CN121982478ACN 121982478 ACN121982478 ACN 121982478ACN-121982478-A

Abstract

The application provides a model optimization method based on multi-scale attention fusion, which comprehensively utilizes visual information of different scales through multi-scale feature fusion and combines a learnable variable to realize dynamic optimization of feature representation, so that a model can effectively distinguish visual similar structures. And simultaneously, a cross-modal attention mechanism is introduced to establish the association between the text prompt and the multi-scale features, and the accurate matching of the attention degree and the labeling area is ensured through optimization. Through multi-scale feature extraction, learning potential variable optimization and cross-modal attention fusion, the problem of identification deviation of similar areas of medical images is effectively solved, the matching precision of the visual prompt labeling areas and the attention of the models is improved, the accuracy of the models in identifying the image areas is improved, and the accuracy of the models in image analysis results is improved. The method can be applied to a vehicle damage assessment function in the financial field or a medical image analysis function in the medical field, so that the accuracy of a vehicle damage assessment result or a medical image analysis result is improved.

Inventors

WANG JIANZONG
ZHANG XULONG
SHI JIAQI

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. The model optimization method based on the multi-scale attention fusion is characterized by comprising the following steps of: acquiring an image to be processed, and a field text prompt and a visual prompt related to the image to be processed; based on different network layers of the multi-scale visual encoder, carrying out multi-scale feature extraction processing on the image to be processed to generate a visual feature map of each scale corresponding to each resolution; Based on the visual cues, respectively performing learnable latent variable optimization on the visual feature graphs of all scales to generate enhanced visual feature representations corresponding to all scales; performing multi-scale attention fusion processing on the enhanced visual feature representations of all scales based on the domain text prompt, and generating a fusion attention map, wherein the fusion attention map is used for representing the association degree of the domain text prompt and the influence area under different scales; optimizing a degree of matching of the enhanced visual feature representation based on the fused attention attempt to the visual cue to adjust the learnable latent variable such that the fused attention attempt to match the region annotated by the visual cue; And optimizing the multi-modal large language model based on the optimized enhanced visual characteristic representation, and generating the domain text prompt and response information corresponding to the visual prompt to complete the optimization of the multi-modal large language model.
2. The model optimization method according to claim 1, wherein the visual feature map of each scale corresponding to each resolution includes a high-resolution visual feature map, a medium-resolution visual feature map and a low-resolution visual feature map, the multi-scale feature extraction processing is performed on the image to be processed based on the multi-scale visual encoder, and the generation of the visual feature map of each scale corresponding to each resolution includes: Extracting the high-resolution visual feature map of the image to be processed containing detail information based on a shallow network of the multi-scale visual encoder; extracting the medium resolution visual feature map of the image to be processed containing structural information based on a middle layer network of the multi-scale visual encoder; And extracting the low-resolution visual feature map of which the image to be processed contains semantic information based on a deep network of the multi-scale visual encoder.
3. The model optimization method of claim 1, wherein the performing a respective learnable latent variable optimization on the visual feature map for each scale based on the visual cues, generating an enhanced visual feature representation for each scale, comprises: Creating a learnable latent variable for each scale visual feature map, wherein the initial value of each learnable latent variable is zero tensor; element-by-element addition of the learnable latent variables for each scale with their corresponding visual feature maps generates an enhanced visual feature representation for each scale.
4. The model optimization method of claim 1, wherein the performing a multi-scale attention fusion process on the enhanced visual feature representation of each scale based on the domain text cues to generate a fused attention map comprises: calculating a context representation of the domain text prompt, and calculating an attention response graph between the context representation and the enhanced visual feature representation of each scale, respectively, to obtain each single-scale attention-seeking graph; and fusing all the single-scale attention force diagrams through presetting a learnable weight matrix, and generating the fused attention force diagram.
5. The model optimization method of claim 4, wherein said fusing each of said single-scale attention attempts by means of weighted summation and a learnable weight matrix to generate said fused attention attempt comprises: Assigning a learnable fusion weight to each single-scale attention attempt; Based on the corresponding leachable fusion weights of each single-scale attention force diagram, carrying out weighted summation on each single-scale attention force diagram to generate an initial fusion attention force diagram; and based on medical prior knowledge, regularizing the initial fusion attention map to generate the fusion attention map.
6. The model optimization method of claim 1, wherein said matching the enhanced visual feature representation based on the fused attention map with the visual cues to adjust the learnable latent variables comprises: normalizing the fusion attention force diagram, and converting the normalized fusion attention force diagram into a probability diagram representing the probability distribution of the spatial attention degree; Determining an area covered by the visual cue in the image to be processed, and generating a binary mask map with the same size as the fusion attention map based on pixel values of the area covered by the visual cue; Calculating an energy function value based on the probability map and the binary mask map, wherein the energy function value is used for representing the degree of difference between the distribution of the probability map and the ideal distribution indicated by the binary mask map; And calculating the gradient of the energy function value relative to the learnable latent variable, and iteratively adjusting the numerical value of the learnable latent variable according to the gradient direction and the magnitude to reduce the energy function value so as to finish the adjustment of the learnable latent variable.
7. The model optimization method according to any one of claims 1-6, wherein optimizing the multi-modal large language model based on the optimized enhanced visual feature representation, generating the domain text prompt and the response information corresponding to the visual prompt, and completing the optimization of the multi-modal large language model, includes: Performing feature fusion and serialization processing on the optimized enhanced visual feature identifiers with multiple scales to generate a visual token sequence, and performing tokenization processing on the field text prompt to generate a text token sequence; Splicing the visual token sequence and the text token sequence to generate a multi-mode input sequence; and inputting the multi-modal input sequence into a pre-trained multi-modal large language model to perform forward reasoning, and performing de-tokenization processing on a logic tensor output by the multi-modal large language model to generate the response information.
8. A model optimization device based on multiscale attention fusion, the model optimization device comprising: The optimizing parameter acquisition module is used for acquiring an image to be processed, a field text prompt and a visual prompt which are related to the image to be processed; The visual feature extraction module is used for carrying out multi-scale feature extraction processing on the image to be processed based on different network layers of the multi-scale visual encoder to generate a visual feature map of each scale corresponding to each resolution; The latent variable optimization module is used for respectively carrying out learnable latent variable optimization on the visual feature graphs of all scales based on the visual prompt and generating an enhanced visual feature representation corresponding to each scale; The attention force diagram fusion module is used for carrying out multi-scale attention fusion processing on the enhanced visual characteristic representations of all scales based on the field text prompt and generating a fusion attention diagram, wherein the fusion attention diagram is used for representing the association degree of the field text prompt and the influence area under different scales; A visual feature optimization module for optimizing a degree of matching of the enhanced visual feature representation based on the fused attention attempt to the visual cue to adjust the learnable latent variable such that the fused attention attempt to match the region annotated by the visual cue; the language model optimizing module is used for optimizing the multi-mode large language model based on the optimized enhanced visual characteristic representation, generating the field text prompt and the response information corresponding to the visual prompt, and completing the optimization of the multi-mode large language model.
9. A computer device comprising a processor, a memory, and a multi-scale attention fusion based model optimization program stored on the memory and executable by the processor, wherein the multi-scale attention fusion based model optimization program, when executed by the processor, implements the steps of the multi-scale attention fusion based model optimization method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a model optimization program based on multi-scale attention fusion is stored, wherein the model optimization program based on multi-scale attention fusion, when executed by a processor, implements the steps of the model optimization method based on multi-scale attention fusion as claimed in any one of claims 1 to 7.

Description

Model optimization method, device, equipment and medium based on multi-scale attention fusion Technical Field The present application relates to the field of image processing, and in particular, to a model optimization method, apparatus, computer device and computer readable storage medium based on multi-scale attention fusion. Background The current multi-modal large model makes significant progress in the task of general visual-language understanding, and can be used for widely describing and asking natural images. However, when the above-mentioned technique is applied to a highly specialized field with extremely high accuracy requirements such as medical image analysis and financial document auditing, the model identification accuracy is at the bottom. The images in the professional field (such as medical CT (computed tomography) film and car risk loss photo) have high structuring and professional properties, the general model lacks field knowledge, and the analysis is generally global and descriptive and cannot meet the transfer requirements of accurate positioning and professional judgment, so that the accuracy of the image analysis results in the fields is low. For example, a learning method based on untrained visual cues in the medical field may be applied to assist doctors in medical image analysis. For example, when a doctor needs to analyze a complex CT scan image, the region of interest can be marked by using graffiti or clicking, etc., and related questions can be purposefully presented to the model based on untrained visual cue learning. However, when analyzing medical images with such models, there is a problem in that medical images often contain multiple similar organs or tissue structures that may be visually very close together, making it difficult for the model to accurately distinguish a particular region marked by a user from surrounding similar regions. For example, in an abdominal CT scan, the liver, spleen, and kidneys may have similar densities and textures. When a physician marks one of the organs and asks for relevant information, the model may erroneously divert attention to neighboring similar organs, thereby affecting the accuracy of the model in outputting the medical image analysis results. In the financial field, particularly in the process of car insurance claims, a large number of accident scene photographs and loss detail drawings are processed to evaluate the loss condition and determine the claim amount. And the car insurance photo usually contains the whole car, various parts and complex background. When an impairment fitter needs to ask about a particular part (e.g. whether there is a crack in the left headlight cover. The model may confuse the left and right headlights or misinterpret non-critical scratches as important damage, thereby affecting the accuracy of the model output impairment analysis results. Therefore, how to improve the accuracy of the model to the image analysis result becomes a technical problem to be solved in the present day. Disclosure of Invention The application mainly aims to provide a model optimization method, a device, computer equipment and a computer readable storage medium based on multi-scale attention fusion, aiming at improving the accuracy of a model on a medical image analysis result. In order to achieve the above object, the present application provides a model optimization method based on multi-scale attention fusion, the model optimization method comprising the steps of: acquiring an image to be processed, and a field text prompt and a visual prompt related to the image to be processed; based on different network layers of the multi-scale visual encoder, carrying out multi-scale feature extraction processing on the image to be processed to generate a visual feature map of each scale corresponding to each resolution; Based on the visual cues, respectively performing learnable latent variable optimization on the visual feature graphs of all scales to generate enhanced visual feature representations corresponding to all scales; performing multi-scale attention fusion processing on the enhanced visual feature representations of all scales based on the domain text prompt, and generating a fusion attention map, wherein the fusion attention map is used for representing the association degree of the domain text prompt and the influence area under different scales; optimizing a degree of matching of the enhanced visual feature representation based on the fused attention attempt to the visual cue to adjust the learnable latent variable such that the fused attention attempt to match the region annotated by the visual cue; And optimizing the multi-modal large language model based on the optimized enhanced visual characteristic representation, and generating the domain text prompt and response information corresponding to the visual prompt to complete the optimization of the multi-modal large language model. In addition, in order to achieve t