CN-122019788-A - Text processing method, device, computer equipment and storage medium

CN122019788ACN 122019788 ACN122019788 ACN 122019788ACN-122019788-A

Abstract

The invention relates to the field of text processing and discloses a text processing method, a device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining at least one target text vector from a vector database according to a query request, wherein the vector database comprises a plurality of text vectors, and the text vectors are obtained according to text slicing; acquiring a target text slice corresponding to the target text vector, and a preamble text slice and a postamble text slice adjacent to the target text slice; and constructing the restored document content according to the target text slice, the preamble text slice and the postamble text slice. The invention solves the problem of incomplete semantics caused by text slicing in the prior art.

Inventors

LI XIANQIANG
LIU ZHIXIA

Assignees

浙江极氪智能科技有限公司
浙江吉利控股集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. A method of text processing, the method comprising: Obtaining at least one target text vector from a vector database according to a query request, wherein the vector database comprises a plurality of text vectors, and the text vectors are obtained according to text slicing; Acquiring a target text slice corresponding to the target text vector, and a preamble text slice and a postamble text slice adjacent to the target text slice; and constructing restored document content according to the target text slice, the preamble text slice and the postamble text slice.
2. The method of claim 1, wherein prior to retrieving the plurality of text vectors from the vector database in accordance with the query request, the method further comprises: acquiring original document content and corresponding document identification; dividing the original document content according to a preset segmentation rule to obtain a plurality of text slices; Generating sequence identifications associated with each text slice based on the segmentation sequences corresponding to the text slices, and constructing corresponding text vectors according to semantic features of the text slices; And storing each text slice, the sequence identifier and the document identifier in a slice relation table in an associated mode, and storing a text vector corresponding to each text slice, the sequence identifier and the document identifier in the vector database in an associated mode.
3. The method of claim 1, wherein the retrieving at least one target text vector from a vector database according to the query request comprises: Resolving a retrieval semantic vector in the query request; calculating the similarity between each text vector in the vector database and the retrieval semantic vector; And acquiring at least one target text vector from the vector database based on the similarity.
4. The method of claim 2, wherein the obtaining the target text slice corresponding to the target text vector, and the preceding text slice and the following text slice adjacent to the target text slice, comprises: Acquiring a target document identifier and a target sequence identifier associated with the target text vector; Querying a corresponding target text slice from the slice relation table by utilizing the target document identification and the target sequence identification; inquiring a preamble slice sequence and a follow-up slice sequence corresponding to the target text slice in the slice relation table; A first number of preamble text slices is back-extracted from the sequence of preamble slices and a second number of preamble text slices is forward-extracted from the sequence of postamble slices.
5. The method according to claim 4, wherein the first number and/or the second number is determined based on at least one of: receiving a first number specified by a user for a preamble slice and a second number specified for a subsequent slice; determining a fragment number requirement according to the semantic integrity of the target text slice, and calculating a first number for a preceding slice and a second number for a subsequent slice based on the fragment number requirement; A first number for a preceding slice and a second number for a subsequent slice are calculated based on a document length of original document content and a preset scaling parameter.
6. The method of claim 1, wherein the constructing restored document content from the target text slice, the preamble text slice, and the postamble text slice comprises: Performing de-duplication treatment on the preamble text slice, the target text slice and the post-preamble text slice to obtain de-duplicated text slices; Sequencing the text slices subjected to the duplication removal according to the sequence identification to generate continuous slice contents; reconstructing a document structure based on the original segmentation position of the continuous slice content, and generating restored document content.
7. The method of claim 6, wherein reconstructing the document structure based on the original segmentation locations of the serial slice content generates restored document content, comprising: identifying semantic consistency of original segmentation locations in the serial slice content; and adding structural elements into the continuous slice content based on the semantic consistency to generate the restored document content.
8. A text processing apparatus, the apparatus comprising: The first acquisition module is used for acquiring at least one target text vector from a vector database according to a query request, wherein the vector database comprises a plurality of text vectors, and the text vectors are obtained according to text slicing; the second acquisition module is used for acquiring a target text slice corresponding to the target text vector, and a preamble text slice and a postamble text slice which are adjacent to the target text slice; And the construction module is used for constructing restored document content according to the target text slice, the preamble text slice and the postamble text slice.
9. A computer device, comprising: a memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, cause the processor to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.

Description

Text processing method, device, computer equipment and storage medium Technical Field The present invention relates to the field of text processing, and in particular, to a text processing method, apparatus, computer device, and storage medium. Background In large model application practice, retrieval enhancement generation (RETRIEVAL-Augmented Generation, RAG) is an important means for large models to obtain dynamic information. Because of the limitation of the context window of the large model, a longer corpus is usually required to be split into smaller semantic segments according to a certain length, and then a part of segments are selected for processing of the large model according to semantic relevance of user query. However, the text slicing method can cause serious semantic breaking problem, destroy continuity of the original text, and influence understanding and generating quality of the whole semantics of the large model. In order to solve the problem of semantic fracture, the prior art mainly adopts two schemes, namely, retaining overlapping contents among paragraphs when text is sliced, and carrying out semantic splitting according to catalogue chapters. However, reserving overlapping content increases the length and semantic complexity of the segments, not only affects the accuracy of vector retrieval, but also increases the understanding difficulty and segment consumption of a large model, and the splitting method based on the catalog chapter has obvious application limitation on the condition that documents with a clear catalog structure are lacking or the chapter length exceeds the processing capacity of the vector model, and still cannot avoid the problem of semantic segmentation. Disclosure of Invention In view of the above, the embodiments of the present invention provide a text processing method, apparatus, computer device, and storage medium, so as to solve the problem of incomplete semantics caused by text slicing in the prior art. In a first aspect, an embodiment of the present invention provides a text processing method, where the method includes: Obtaining at least one target text vector from a vector database according to a query request, wherein the vector database comprises a plurality of text vectors, and the text vectors are obtained according to text slicing; Acquiring a target text slice corresponding to the target text vector, and a preamble text slice and a postamble text slice adjacent to the target text slice; and constructing restored document content according to the target text slice, the preamble text slice and the postamble text slice. Further, before obtaining the plurality of text vectors from the vector database according to the query request, the method further comprises: acquiring original document content and corresponding document identification; dividing the original document content according to a preset segmentation rule to obtain a plurality of text slices; Generating sequence identifications associated with each text slice based on the segmentation sequences corresponding to the text slices, and constructing corresponding text vectors according to semantic features of the text slices; And storing each text slice, the sequence identifier and the document identifier in a slice relation table in an associated mode, and storing a text vector corresponding to each text slice, the sequence identifier and the document identifier in the vector database in an associated mode. Further, the obtaining at least one target text vector from the vector database according to the query request includes: Resolving a retrieval semantic vector in the query request; calculating the similarity between each text vector in the vector database and the retrieval semantic vector; And acquiring at least one target text vector from the vector database based on the similarity. Further, the obtaining the target text slice corresponding to the target text vector, and the preamble text slice and the postamble text slice adjacent to the target text slice includes: Acquiring a target document identifier and a target sequence identifier associated with the target text vector; Querying a corresponding target text slice from the slice relation table by utilizing the target document identification and the target sequence identification; inquiring a preamble slice sequence and a follow-up slice sequence corresponding to the target text slice in the slice relation table; A first number of preamble text slices is back-extracted from the sequence of preamble slices and a second number of preamble text slices is forward-extracted from the sequence of postamble slices. Further, the first number and/or the second number is determined based on at least one of: receiving a first number specified by a user for a preamble slice and a second number specified for a subsequent slice; determining a fragment number requirement according to the semantic integrity of the target text slice, and calculating a fir