CN-122021902-A - Image-text correlation detection method and device

CN122021902ACN 122021902 ACN122021902 ACN 122021902ACN-122021902-A

Abstract

The disclosure provides a method and a device for detecting image-text relativity, which relate to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision, natural language processing, large models and deep learning. The method comprises the steps of carrying out multidimensional correlation detection on image context information and document structure data to generate image-text correlation scores, responding to the fact that the image-text correlation scores are not larger than a first preset score threshold value and not smaller than a second preset score threshold value, generating context abstract information based on the image context information and the document structure data, generating prompt words based on domain knowledge, similar examples and the context abstract information, inputting the image context information, the document structure data and the prompt words into a large language model, and outputting image-text correlation detection results.

Inventors

HAN XINYING

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (13)

1. A picture and text relativity detection method comprises the following steps: performing multidimensional correlation detection on the image context information and the document structure data to generate image-text correlation scores; Generating context abstract information based on the image context information and the document structure data in response to determining that the image-text relevance score is not greater than a first preset score threshold and not less than a second preset score threshold; Generating a prompt word based on the domain knowledge, the similar examples and the context abstract information; And inputting the image context information, the document structure data and the prompt word into a large language model, and outputting a picture-text correlation detection result.
2. The method of claim 1, wherein the multi-dimensional correlation detection of the image context information and the document structure data to generate a teletext correlation score comprises: Performing correlation detection of at least two dimensions on the image context information and the document structure data, and generating similarity scores corresponding to the at least two dimensions, wherein the correlation detection comprises at least two dimensions of position adjacency, explicit reference detection, chapter consistency detection, semantic density detection and drawing information detection; And carrying out weighted summation on the similarity scores corresponding to the at least two dimensions to generate the image-text relevance score.
3. The method of claim 2, wherein the performing correlation detection on the image context information and the document structure data in at least two dimensions generates a similarity score corresponding to the at least two dimensions, comprising at least two of: Determining a similarity score corresponding to the position adjacency based on the distance between the image and the front and back paragraphs and the paragraph length; Detecting a preset reference pattern in the text through a regular expression, and determining a similarity score corresponding to explicit reference detection; detecting preset keywords in chapter titles, and determining similarity scores corresponding to chapter consistency detection; Counting the number of preset professional terms in each preset word number of the image context information, and determining similarity scores corresponding to semantic density detection; and determining a similarity score corresponding to the drawing information detection based on the drawing length and the effective information in the drawing.
4. The method of claim 1, wherein the generating a hint word based on domain knowledge, a similar instance, and the context abstract information comprises: loading a domain knowledge base and an example base; Retrieving the similar examples from the example library based on the context abstract information; Converting the matching rule information in the multi-dimensional correlation detection into natural language prompt information; And filling the domain knowledge, the similar examples, the natural language prompt information and the context abstract information in the domain knowledge base into a prompt word template to generate the prompt word.
5. The method of claim 1, wherein the inputting the image context information, the document structure data, and the hint word into a large language model, outputting a graph-text relevance detection result, comprises: initializing configuration of a large language model application program interface LLM API; inputting the prompt words, the image uniform resource locators and the reasoning depth to the LLM API; Configuring the maximum token number and the temperature coefficient according to the reasoning depth; and calling the LLM API, and reasoning the image context information, the document structure data and the prompt word to generate the image-text relevance detection result.
6. The method of claim 5, wherein the inputting the image context information, the document structure data, and the hint word into a large language model, outputting a graph-text relevance detection result, further comprises: Constructing a reverse verification prompt word corresponding to the prompt word; Inputting the image context information, the document structure data and the reverse verification prompt word into the large language model, and outputting a reverse verification result; And adjusting the confidence coefficient of the image-text correlation detection result based on the reverse verification result.
7. The method of claim 4, wherein the method further comprises: Acquiring manual feedback data corresponding to the image-text correlation detection result; Based on the manual feedback data, updating the example library and the cue word templates.
8. The method of claim 7, wherein the updating the example library and the hint word templates based on the manual feedback data comprises: calculating the accuracy of different prompt word templates based on the manual feedback data; selecting a target prompt word template based on the accuracy; analyzing the error type corresponding to the manual feedback data in response to the data quantity of the manual feedback data reaching a preset data quantity threshold; Extracting samples from the error cases based on the error types, and constructing new examples; the new instance is added to the instance library.
9. The method of any of claims 1-8, wherein the method further comprises: outputting an image-text matching result in response to determining that the image-text relevance score is greater than the first preset score threshold; And outputting a graph-text mismatch result in response to determining that the graph-text relevance score is smaller than the second preset score threshold.
10. An image-text correlation detection device, comprising: The first detection module is configured to carry out multidimensional correlation detection on the image context information and the document structure data to generate image-text correlation scores; A first generation module configured to generate context abstract information based on the image context information and the document structure data in response to determining that the teletext correlation score is not greater than a first preset score threshold and not less than a second preset score threshold; a second generation module configured to generate a hint word based on domain knowledge, a similar instance, and the context abstract information; And the second detection module is configured to input the image context information, the document structure data and the prompt word into a large language model and output a picture-text correlation detection result.
11. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.

Description

Image-text correlation detection method and device Technical Field The present disclosure relates to the field of artificial intelligence, and more particularly to the fields of image processing, computer vision, natural language processing, large models, and deep learning. Background In the context of digital content explosive growth, image-text relevance detection in rich text documents has become a core requirement for content quality control, information retrieval optimization and compliance auditing. With the continuous emergence of specialized content in multiple fields, the image-text correlation detection is required to accurately judge the association degree of images and texts, and also is required to adapt to term systems, scene characteristics and document structures in different fields so as to meet the application requirements in diversified scenes, and the detection efficiency, adaptation capability and practicability of the method are widely focused by industries. At present, the field of image-text correlation detection forms a plurality of technical realization paths. The rule matching method is used for judging the relevance of the image and text through fixed rules such as regular expressions, keyword matching and the like, does not need machine learning technology support, has the advantage of being fast applied in a simple scene, adopts classical models such as SVM (Support Vector Machine ), CRF (Conditional Random Field, conditional random field) and the like, relies on manual labeling features and a large amount of training data to construct a detection model, provides a basic technical scheme for image and text matching, builds an end-to-end image and text matching network based on deep learning models such as CNN (Convolutional Neural Network ), BERT (Bidirectional Encoder Representations from Transformers, and the like, realizes relevance judgment through model autonomous learning features, directly invokes multimode big models such as GPT (GENERATIVE PRE-trained Transformer, generate pre-training transducers) and the like by means of a simple prompt word guiding model, fully utilizes the cross-mode understanding capability of the big models, combines rules and the simple models to perform preliminary filtering, and integrates basic advantage detection work of two technologies. Disclosure of Invention The embodiment of the disclosure provides a method, a device, equipment, a storage medium and a program product for detecting image-text relativity. In a first aspect, an embodiment of the disclosure provides a method for detecting image-text relativity, which includes performing multidimensional relativity detection on image context information and document structure data to generate image-text relativity scores, generating context abstract information based on the image context information and the document structure data in response to determining that the image-text relativity scores are not larger than a first preset score threshold and not smaller than a second preset score threshold, generating prompt words based on domain knowledge, similar examples and the context abstract information, inputting the image context information, the document structure data and the prompt words into a large language model, and outputting image-text relativity detection results. In a second aspect, an embodiment of the disclosure provides an image-text relevance detection device, which comprises a first detection module configured to perform multidimensional relevance detection on image context information and document structure data to generate an image-text relevance score, a first generation module configured to generate context abstract information based on the image context information and the document structure data in response to determining that the image-text relevance score is not greater than a first preset score threshold and not less than a second preset score threshold, a second generation module configured to generate a prompt word based on domain knowledge, a similar example and the context abstract information, and a second detection module configured to input the image context information, the document structure data and the prompt word into a large language model to output an image-text relevance detection result. In a third aspect, an embodiment of the present disclosure provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect. In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in the first aspect. In a fifth aspect, embodiments of the present disclosure propose a computer program produ