CN-121999508-A - Processing method, equipment, storage medium and program product for graphic and text semantic association

CN121999508ACN 121999508 ACN121999508 ACN 121999508ACN-121999508-A

Abstract

The application provides a processing method, equipment, a storage medium and a program product for graphic and text semantic association. The method comprises the steps of determining a plurality of text elements, a plurality of picture elements, a picture title and an in-picture text corresponding to each picture element of a document to be processed, acquiring at least one first text element in a preset range of the picture elements for each picture element, determining association weights between the picture elements and each text element, determining first similarity between the in-picture text corresponding to the picture elements and each first text element, determining second similarity between the picture title corresponding to the picture elements and each second text element, determining at least one target text element associated with the picture elements according to the at least one association weights, the at least one first similarity and the at least one second similarity, and associating the picture elements with the at least one target text element to obtain a corresponding retrieval block. The method improves the image-text document processing efficiency.

Inventors

Ai Hongfeng
ZHANG HANG
PEI HONGXIANG
ZHENG GUOWEI
WANG LIN
HE JIANJIANG
DENG SHAOWEN

Assignees

中国联合网络通信集团有限公司

Dates

Publication Date: 20260508
Application Date: 20241106

Claims (14)

1. The processing method of the graphic semantic association is characterized by comprising the following steps of: Determining a plurality of text elements, a plurality of picture elements and picture titles and intra-picture texts corresponding to each picture element of a document to be processed; for each picture element, acquiring at least one first text element in a preset range of the picture element; Determining an association weight between the picture element and each text element, wherein the association weight is used for indicating the association degree of the position between the picture element and each text element; determining a first similarity between a text in the picture corresponding to the picture element and each first text element, and determining a second similarity between a picture title corresponding to the picture element and each second text element, wherein the second text element is a text element except the picture title in the at least one first text element; Determining at least one target text element associated with the picture element according to at least one association weight, at least one first similarity and at least one second similarity; And associating the picture element with the at least one target text element to obtain a retrieval block corresponding to the picture element.
2. The method according to claim 1, wherein the method further comprises: A classification probability between the picture element and each first text element determined based on a pre-trained neural network model; correspondingly, the determining at least one target text element associated with the picture element according to the at least one association weight, the at least one first similarity and the at least one second similarity comprises: at least one target text element associated with the picture element is determined based on at least one association weight, at least one first similarity, at least one second similarity, and at least one classification probability.
3. The method according to claim 1 or 2, wherein said determining the association weight between the picture element and each text element comprises: determining a first association parameter according to a picture title corresponding to the picture element, wherein the first association parameter is any numerical value from 0 to 1; And determining the association weight between the picture element and each text element according to the first association parameter, the picture position of the picture element and the text position of each text element.
4. A method according to claim 3, characterized in that the method further comprises: If the picture header corresponding to the picture element is empty, determining a second association parameter, wherein the second association parameter is smaller than the first association parameter; And determining the association weight between the picture element and each text element according to the second association parameter, the picture position of the picture element and the text position of each text element.
5. The method according to claim 1 or 2, wherein determining the first similarity between the text in the picture corresponding to the picture element and each first text element comprises: carrying out text vectorization processing on the text in the picture corresponding to the picture element to obtain text characteristics in the picture; performing text vectorization processing on each first text element to obtain corresponding first text element characteristics; and obtaining a first similarity according to the text features in the pictures and at least one first text element feature.
6. The method according to claim 1 or 2, wherein said determining a second similarity between a picture title corresponding to the picture element and each second text element comprises: Determining a second text element according to the picture title corresponding to the picture element and the first text element; Performing text vectorization processing on the picture titles corresponding to the picture elements to obtain picture title features; performing text vectorization processing on each second text element to obtain corresponding second text element characteristics; and obtaining a second similarity according to the picture title characteristics and at least one second text element characteristic.
7. The method according to claim 1 or 2, wherein determining a plurality of text elements, a plurality of picture elements, and a picture title and an in-picture text corresponding to each picture element of the document to be processed comprises: Element analysis is carried out on a document to be processed, and a plurality of text elements and a plurality of picture elements of the document to be processed are obtained; determining a text position and font attribute of each text element, and a picture position of each picture element; generating an in-picture text corresponding to each picture element; And for each picture element, determining a picture title of the picture element according to the text position and the font attribute of each text element and the picture position of the picture element.
8. The method of claim 7, wherein generating, for each picture element, in-picture text corresponding to the picture element, comprises: For each picture element, carrying out text recognition on the picture element to obtain the text word quantity of the picture element; and if the text quantity meets the text quantity requirement of the text in the preset picture, generating the text in the picture.
9. The method of claim 7, wherein determining, for each picture element, a picture title for the picture element based on a text position and font properties of each text element, a picture position of the picture element, comprises: For each picture element, according to the text position of each text element and the picture position of the picture element, a third text element which really meets the preset position relation with the picture element; if the font attribute of the third text element is different from the font attribute of other text elements, determining that the third text element is the picture title of the picture element; and/or the number of the groups of groups, If the third text element has the preset element character, determining that the third text element is the picture title of the picture element; and/or the number of the groups of groups, And if the text quantity of the third text element meets the preset text quantity requirement, determining that the third text element is the picture title of the picture element.
10. The method according to claim 1 or 2, characterized in that the method further comprises: receiving a search request, wherein the search request comprises search information; And obtaining a search result according to the search information and a search block corresponding to the picture element stored in the picture-text database, and returning the search result.
11. A processing device for semantic association of graphics and text, comprising: the first determining module is used for determining a plurality of text elements, a plurality of picture elements and picture titles and intra-picture texts corresponding to each picture element of the document to be processed; the acquisition module is used for acquiring at least one first text element in a preset range of the picture element for each picture element; The first processing module is used for determining the association weight between the picture element and each text element, and the association weight is used for indicating the association degree of the position between the picture element and each text element; a second processing module, configured to determine a first similarity between text in a picture corresponding to the picture element and each first text element, and determine a second similarity between a picture title corresponding to the picture element and each second text element, where the second text element is a text element in the at least one first text element except for the picture title; The second determining module is used for determining at least one target text element associated with the picture element according to at least one association weight, at least one first similarity and at least one second similarity; and the association module is used for associating the picture element with the at least one target text element to obtain a retrieval block corresponding to the picture element.
12. An electronic device is characterized by comprising a memory and a processor; The memory stores computer-executable instructions; The processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-10.
13. A computer readable storage medium having stored therein computer executable instructions which when executed are adapted to implement the method of any of claims 1-10.
14. A computer program product comprising a computer program which, when executed, implements the method of any of claims 1-10.

Description

Processing method, equipment, storage medium and program product for graphic and text semantic association Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a processing method, apparatus, storage medium, and program product for semantic association of graphics and text. Background Currently, knowledge question-answering techniques combining search enhancement generation and large language models are attracting attention in the industry. Such techniques are excellent in the processing and generation of natural language text, particularly in application scenarios where it is desirable to retrieve relevant information from a vast array of data and give accurate answers. However, these techniques have a limitation in dealing with the content of the graphic mixed knowledge, and are inferior in terms of processing documents including image user manuals, academic papers, and the like. For effectively reading, preprocessing and processing the knowledge files mixed with the graphics and texts, the academic world and the industry mainly apply two strategies, namely, firstly, extracting text information in an image by means of an optical character recognition (Optical Character Recognition, OCR) technology and combining the text information with a text block, and secondly, adopting a CLIP (Contrastive Language-IMAGE PRETRAINING) model of OpenAI, which is a graphics and texts multi-mode model and can find out a text matched with the image and a group of text fragments when the image and the text fragments are given. The two modes have the advantages that the OCR technology mainly focuses on extracting clear text information from the image, and the CLIP model attaches more importance to semantic information underlying the image. However, in practical application, the existing knowledge file processing method has the problem of low retrieval efficiency. Disclosure of Invention The embodiment of the application provides a processing method, equipment, a storage medium and a program product for graphic and text semantic association, which are used for solving the problem of low retrieval efficiency of the existing knowledge file processing method. In a first aspect, an embodiment of the present application provides a processing method for semantic association of graphics and text, including: Determining a plurality of text elements, a plurality of picture elements and picture titles and intra-picture texts corresponding to each picture element of a document to be processed; for each picture element, at least one first text element in a preset range of the picture element is obtained; determining an association weight between the picture element and each text element, wherein the association weight is used for indicating the association degree of the position between the picture element and each text element; Determining a first similarity between a text in a picture corresponding to the picture element and each first text element, and determining a second similarity between a picture title corresponding to the picture element and each second text element, wherein the second text element is a text element except for the picture title in at least one first text element; determining at least one target text element associated with the picture element according to the at least one association weight, the at least one first similarity and the at least one second similarity; and associating the picture element with at least one target text element to obtain a retrieval block corresponding to the picture element. In one possible embodiment, the method further comprises: a classification probability between the picture element and each first text element determined based on the pre-trained neural network model; accordingly, determining at least one target text element associated with the picture element according to the at least one association weight, the at least one first similarity and the at least one second similarity, comprises: At least one target text element associated with the picture element is determined based on the at least one association weight, the at least one first similarity, the at least one second similarity, and the at least one classification probability. In one possible implementation, determining the association weight between the picture element and each text element includes: determining a first association parameter according to a picture title corresponding to the picture element, wherein the first association parameter is any numerical value from 0 to 1; And determining the association weight between the picture element and each text element according to the first association parameter, the picture position of the picture element and the text position of each text element. In one possible embodiment, the method further comprises: If the picture header corresponding to the picture element is empty, determining a second associa