CN-122019748-A - Data processing method and device suitable for retrieval enhancement generation

CN122019748ACN 122019748 ACN122019748 ACN 122019748ACN-122019748-A

Abstract

The embodiment of the specification provides a data processing method and device suitable for search enhancement generation, which can structure various documents into Markdown format text, so that the method and device can be suitable for various slicing modes and long text. Specifically, the text is extracted through a large language model to obtain initial structure information of the whole text, and then the document structure information of the whole text is optimized through the large language model to finely distinguish the layers, so that corrected document structure information, such as target structure information, is obtained. And mapping the target structure information to the corresponding position of the initial text to obtain the text with a fine structure, wherein the text is used for slicing the document in various slicing modes by the conventional technology. In this way, the risk of illusion of large language models can be reduced, reducing costs and processing time.

Inventors

LIU JIAWEI
ZHU YAGUANG
ZHENG YANJUN
QI XIANG
ZHANG PENG

Assignees

蚂蚁区块链科技(上海)有限公司

Dates

Publication Date: 20260512
Application Date: 20260106

Claims (10)

1. A data processing method adapted for search enhancement generation for converting text as a search enhancement into Markdown format text, the method comprising: Acquiring a first text to be processed; Extracting a text structure of a first text by using a structure extraction large model to obtain initial structure information, wherein the initial structure information comprises a title and a hierarchical identifier under a MarkDown format; Performing hierarchical relation alignment on the initial structure information through an alignment large model to obtain target structure information corrected by aiming at the initial structure information; And mapping the target structure information into the first text to obtain a first structured text in a MarkDown format, and performing text slicing according to a preset mode.
2. The method of claim 1, wherein the structure extraction large model corresponds to a length window describing a maximum length of text that the structure extraction large model processes at once; and under the condition that the byte number of the first text is larger than the length window, extracting the text structure of the first text by using a structure extraction large model to obtain initial structure information, wherein the method comprises the following steps of: intercepting the first text into a plurality of fragments according to the length window, wherein the byte number of each fragment is not more than the length window; extracting the structure information of each sub-document corresponding to each segment by using the structure extraction large model; and merging the structural information of each sub-document to obtain the initial structural information of the first text.
3. The method of claim 1, wherein the hierarchical relationship alignment includes aligning titles that satisfy a predetermined rule to the same hierarchy, the predetermined rule including at least one of semantic similarity greater than a predetermined similarity threshold, containing the same key word, containing the same sentence pattern.
4. The method of claim 1, wherein the structure extraction large model and the alignment large model are the same large language model.
5. The method of claim 1, wherein the target structure information comprises a plurality of structures, a single structure comprising a single title and a corresponding structure level identification, the mapping the target structure information into the first text to obtain a first structured text in Markdown format comprising: sentence segmentation processing is carried out on the first text to obtain each sentence; And sequentially using each title in the structures to match in each sentence to obtain each target sentence which is matched with each title, and using the corresponding structures to replace each matched target sentence respectively to obtain the first structured text.
6. The method of claim 5, wherein the plurality of structures includes a first structure, the first structure corresponding to a first title, the sequentially matching each title in each sentence with each title in the plurality of structures to obtain each target sentence respectively matched with each title, comprising: under the condition that the first structure is a first structure, sequentially matching each sentence with the first title from the first sentence until a matched first target sentence is obtained; And under the condition that the first structure is not the first structure, sequentially matching each sentence with the first title from the next sentence of the target sentence matched with the previous structure until the matched first target sentence is obtained.
7. The method of claim 5 or 6, wherein the single title matches each sentence in the manner of: Calculating the similarity of the title character string corresponding to the single title and each sentence character string by at least one of editing distance ratio, longest public substring and predetermined model scoring; And determining the sentence with the maximum similarity as a single target sentence matched with the single title.
8. The method of claim 7, wherein, in case that a single title matches a single sentence, a longest common substring of the single sentence that matches the single title is acquired as a target sentence, and a subsequent character of the longest common substring is taken as a next sentence for matching with a next title.
9. The method of claim 1, wherein the predetermined manner is structured slicing according to logical structure slicing rules of at least one of title, chapter, table, list.
10. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-9.

Description

Data processing method and device suitable for retrieval enhancement generation Technical Field One or more embodiments of the present disclosure relate to the field of computer technology, and more particularly, to a data processing method and apparatus suitable for search enhancement generation. Background With the rapid development of Natural Language Processing (NLP) technology, generative language models (such as GPT, LLaMA, etc.) exhibit strong capabilities in the fields of text generation, question-answering systems, dialogue interactions, etc. However, these models also have some inherent limitations. For example, their knowledge is entirely dependent on training data, and the latest information cannot be obtained in real time, and the generated content may lack accuracy, and even "illusion" phenomenon (i.e., the generated content appears reasonable but not in line with the fact) occurs. To address these issues, researchers have proposed the architecture of RAG (RETRIEVAL-Augmented Generation, search enhancement generation). The core idea of the RAG is to combine the external knowledge base with the generation model, obtain the relevant text or fragment from the external knowledge base by the retrieval module, and then generate the final output by the generation module based on the enhancement of these extension information. The method not only improves the accuracy and timeliness of the generated content, but also enhances the performance of the model in specific field tasks. For example, in the vertical fields of medicine, law, or scientific research, the RAG may generate more trusted answers by referencing authoritative data. The quality of knowledge supply is critical when constructing a RAG system. In practice, the original input file may originate from a variety of sources, such as PDF (Portable Document Format ) documents, scanned pieces, pictures, word files, or other unstructured formats. However, textual structural information is critical to the performance enhancement of the RAG system, and provides semantic and logical context cues for the system. The structured information (e.g., title, paragraph, chapter) of the optimized text can be used to segment the text into smaller semantic units (chunks), facilitating the construction of the index, and good text structure can significantly improve the performance of the retrieval module. The various documents often contain complex typesets, embedded images or tables, and direct use can lead to difficult information extraction, low retrieval efficiency, and even cause false parsing. If various models (such as OCR model, ASR speech recognition model and layout detection model) are used for file transcription, problems of text structure loss, text structure insufficiency and text structure errors may occur due to the limitation of the models or the fact that the original file content is unstructured. Disclosure of Invention One or more embodiments of the present specification describe a data processing method and apparatus suitable for search enhancement generation to solve one or more of the problems noted in the background. According to a first aspect, there is provided a data processing method adapted for search enhancement generation for converting text as a search enhancement into Markdown format text, the method comprising obtaining a first text to be processed; extracting a text structure of a first text by using a structure extraction large model to obtain initial structure information, wherein the initial structure information comprises a title and a hierarchical identifier in a Markdown format, aligning the initial structure information in a hierarchical relationship by using an alignment large model to obtain target structure information corrected by the initial structure information, and mapping the target structure information into the first text to obtain the first structured text in the Markdown format for text slicing according to a preset mode. In one embodiment, the structure extraction large model corresponds to a length window for describing the maximum length of text processed by the structure extraction large model at a time; under the condition that the byte number of the first text is larger than the length window, the structure extraction large model is utilized to extract the text structure of the first text to obtain initial structure information, and the method comprises the steps of intercepting the first text into a plurality of fragments according to the length window, wherein the byte number of a single fragment is not larger than the length window, extracting the structure information of each sub-document respectively corresponding to each fragment by utilizing the structure extraction large model, and merging the structure information of each sub-document to obtain the initial structure information of the first text. In one embodiment, the hierarchical relationship alignment includes aligning titles that satisfy a