CN-122024025-A - Method and computing equipment for analyzing images based on multi-mode large model

CN122024025ACN 122024025 ACN122024025 ACN 122024025ACN-122024025-A

Abstract

The embodiment of the specification provides a method and a computing device for analyzing an image based on a multi-mode large model, wherein the method comprises the steps of acquiring the image to be analyzed, determining abnormal visual information based on pixel information of the image by the multi-mode large model, determining abnormal logic information from the image, performing cross verification based on the abnormal visual information and the abnormal logic information, and generating an analysis result which is used for explaining whether an abnormal region exists in the image.

Inventors

ZENG FANWEI
LI JIANSHU
YAO WEIBIN

Assignees

蚂蚁区块链科技(上海)有限公司

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (10)

1. A method of analyzing an image based on a multi-modal large model, the method comprising: Acquiring an image to be analyzed; And determining abnormal visual information based on pixel information of the image by the multi-mode large model, determining abnormal logic information based on semantic conflict in the image, and performing cross verification based on the abnormal visual information and the abnormal logic information to generate an analysis result, wherein the analysis result is used for explaining whether an abnormal region exists in the image.
2. The method of claim 1, wherein the analysis results include a detection conclusion indicating whether an abnormal region exists in the image, location information indicating a location of the abnormal region, and an interpretation reason for interpreting the reason for the existence of the abnormal region in the image.
3. The method of claim 1, wherein the determining, by the multimodal mass model, outlier visual information based on pixel information of the image, outlier logical information based on semantic conflicts in the image, and cross-validating based on the outlier visual information and the outlier logical information, generating an analysis result, comprises: Generating, by the multimodal mass model, a mental chain based on the image, the mental chain comprising statements of a plurality of execution steps and execution results of each of the execution steps, the plurality of execution steps comprising determining abnormal visual information based on pixel information of the image, determining abnormal logical information based on semantic conflicts in the image, cross-validating based on the abnormal visual information and the abnormal logical information, determining a location of an abnormal region in the image; And generating an analysis result based on the execution result of each execution step by the multi-mode big model.
4. The method of claim 1, wherein the determining, by the multimodal mass model, outlier visual information based on pixel information of the image, outlier logical information based on semantic conflicts in the image, and cross-validating based on the outlier visual information and the outlier logical information, generating an analysis result, comprises: Encoding the image to obtain an embedded sequence; Inputting the embedded sequence into the multi-mode large model, determining abnormal visual information by the multi-mode large model based on pixel information of the image, determining abnormal logic information based on semantic conflict in the image, and performing cross verification based on the abnormal visual information and the abnormal logic information to generate an analysis result.
5. The method of claim 1, wherein the multimodal mass model is obtained by supervised learning training from first sample data, the first sample data comprising a first sample image and training labels, the training labels comprising label thought chains and label analysis results.
6. The method of claim 5, wherein the tag thought chain includes statements of a plurality of tag execution steps and tag execution results of the respective tag execution steps.
7. The method of claim 5, wherein after supervised learning training of the multi-modal large model by the first sample data, the method further comprises: Obtaining second sample data, wherein the second sample data comprises a second sample image and a truth value label, and the truth value label comprises a conclusion truth value, a positioning truth value and a reason truth value; generating a second analysis result from the multi-modal large model based on the second sample image, the second analysis result including a prediction conclusion, a prediction positioning, and a prediction reason; Calculating a bonus point based on the second analysis result and the truth label; network parameters of the multimodal large model are adjusted based on the bonus points by a reinforcement learning algorithm.
8. The method of claim 7, wherein the calculating a bonus point based on the second analysis result and the truth label comprises: comparing the predicted location with the true location to obtain a first bonus point; calculating the similarity between the predicted reason and the reason true value to obtain a second prize fraction; determining the bonus points according to the first bonus points and the second bonus points.
9. The method of claim 8, wherein the method further comprises: Performing format detection on the second analysis result to obtain a third prize fraction; the determining the bonus point according to the first bonus point and the second bonus point includes: determining the bonus points according to the first bonus points, the second bonus points and the third bonus points.
10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-9.

Description

Method and computing equipment for analyzing images based on multi-mode large model Technical Field The embodiment of the specification belongs to the technical field of image processing, and particularly relates to a method and computing equipment for analyzing images based on a multi-mode large model. Background Along with the rapid development of image processing, the analysis of whether an abnormal region exists in an image is widely applied to various scenes such as medical image auxiliary diagnosis, industrial defect detection, intelligent security and image tampering evidence collection and the like. An outlier region refers to a local region in an image that deviates significantly from the normal pattern, expected structure, or background context, which generally does not conform in appearance, texture, shape, semantics, or statistical characteristics to the distribution of normal samples in the scene. For example, in medical imaging, it is necessary to accurately analyze and locate the spatial coordinates of an abnormal region where a lesion or abnormal tissue is located, and in image forgery analysis, it is necessary to detect tamper marks and mark the modified abnormal region. In addition, in order to improve the reliability of the detection result, the analysis system generally needs to provide an understandable explanation of the reason and the position of the abnormal region, so as to help the user understand the judgment basis and perform manual review. Currently, many image analysis methods rely mainly on the detection of low-level visual cues (e.g., noise distribution, compression artifacts, etc.) to analyze images for the presence of abnormal regions. For example, the common practice for anomaly detection of low-level visual cues is to divide an image into local blocks or sliding windows, extract noise residual statistics and compression artifact features in each local region, compare the noise residual statistics and compression artifact features with the "normal" statistics pattern of a neighborhood or the whole image, calculate the degree of difference, and determine that certain regions are likely to be abnormal regions and form corresponding abnormal thermodynamic diagrams or region positioning results when certain regions deviate from the background or normal distribution in the low-level statistics significantly. However, such methods tend to be limited in effectiveness in scenes where visual marks are not apparent. For example, when the amount of money in an image is tampered from "11.0" to "10.0", the edit area is usually small, and may undergo refinement processing such as font matching, edge smoothing, antialiasing, and recompression, so that the tampered area is highly consistent with the surrounding background in low-level visual characteristics such as noise statistics and compression distortion, and it is difficult to form a recognizable abnormal signal. If the detection logic excessively depends on whether a significant visual difference exists, once the fake trace is deliberately weakened or no significant inconsistency is introduced into the abnormal operation itself, the missing detection risk of the model is significantly increased, thereby restricting the accuracy and the robustness of the overall analysis. Therefore, there is a need for a method that can accurately analyze whether an abnormal region exists in an image. Disclosure of Invention The invention aims to provide a method and a computing device for analyzing an image based on a multi-mode large model, so as to accurately analyze whether an abnormal region exists in the image. The first aspect of the specification provides a method for analyzing an image based on a multi-mode large model, which comprises the steps of obtaining the image to be analyzed, determining abnormal visual information by the multi-mode large model based on pixel information of the image, determining abnormal logic information based on semantic conflict in the image, and performing cross-validation based on the abnormal visual information and the abnormal logic information to generate an analysis result, wherein the analysis result is used for explaining whether an abnormal region exists in the image. A second aspect of the present description provides a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements the method of any of the first aspects. In the scheme provided by the specification, when the image analysis is carried out, whether an abnormal region exists in the image is analyzed by extracting visual abnormality information and logic abnormality information in the image and carrying out cross verification on the abnormal visual information and the abnormal logic information in reasoning. The mutual verification of the abnormal visual information and the abnormal logic information improves the accuracy of abnormal region detection, and meanw