CN-116306527-B - Text processing method, model training method, device, equipment and storage medium

CN116306527BCN 116306527 BCN116306527 BCN 116306527BCN-116306527-B

Abstract

The application provides a text processing method, a model training method, a device, equipment and a storage medium, and relates to the technical field of neural networks. The text processing model is obtained through training the training sample text added with the separation mark, and because the training sample text is marked with the label information and the position information of the separation mark, the label information of the separation mark refers to whether the text at the position of the separation mark needs to be combined or not, the label information of the separation mark is generated according to the real semantics of the text at the position of the separation mark in the training sample text, the label accuracy is higher, and therefore the text processing model obtained through training can be used for accurately combining the target processing text based on the label information and the position information of the separation mark marked by the training sample text. The training sample texts can be obtained by serially connecting a plurality of lines of texts, so that the training text processing model can be suitable for merging the plurality of lines of texts, and the efficiency of merging the plurality of lines of texts is improved.

Inventors

Yang Daicong
LI XIAOPING
GU WENBIN
SUN YONG
LIU ZHIQIANG

Assignees

杭州恒生聚源信息技术有限公司
上海恒生聚源数据服务有限公司

Dates

Publication Date: 20260508
Application Date: 20221212

Claims (15)

1. A text processing method, comprising: Reading the text of at least one cell in the file to be processed; adding a separation mark to the text of the at least one cell to obtain a target processing text; the method comprises the steps of inputting a target processing text into a pre-trained text processing model, identifying whether segmented texts of separation marks in the target processing text need to be combined or not, carrying out combination processing on the target processing text according to identification results to obtain at least one target text, training the text processing model by using the training sample text with marking information, wherein the marking information comprises label information of the separation marks added to the training sample text and positions of the separation marks, the label information of the separation marks is used for indicating whether texts of the positions of the separation marks need to be combined or not, generating the label information based on real semantics of the texts of the positions of the separation marks in the training sample text, inputting the training sample text with marking information into an initial text processing model, predicting to obtain a prediction result of each separation mark according to the positions of the training sample text and the separation marks in the training sample text, and indicating the probability that the texts of the separation marks do not need to be combined on the positions of the separation marks, calculating the label information of the separation marks according to the prediction result of each separation mark and the position of each separation mark, carrying out iterative processing on the basis of the training sample text, carrying out the training sample text with marking information, carrying out combination processing on the target processing according to the identification results, and carrying out iteration processing on the labels in the target processing text, wherein the training model comprises the identification mode, and the training text processing model comprises whether the text is required to be combined or not processed, and determining whether texts segmented by each separation mark need to be combined according to the text combining mode, wherein the text combining mode indicated by the separation mark comprises that the texts need to be combined and the texts do not need to be combined.
2. The method of claim 1, wherein adding a separator mark to the text of the at least one cell results in target processed text, comprising: and adding a separation mark between texts of each adjacent cell to obtain target processing texts.
3. The method of claim 1, wherein adding a separator mark to the text of the at least one cell results in target processed text, comprising: And adding separation marks between texts of each adjacent cell, and adding the separation marks between texts in each cell to obtain target processing texts.
4. A method according to claim 3, wherein said adding separator marks between text within each cell comprises: and inserting a separation mark in at least one random position of the text in each cell to obtain the target processing text.
5. A method according to claim 3, wherein said adding separator marks between text within each cell comprises: Word segmentation is carried out on the texts in the cells, and word segmentation processing results are obtained; Determining at least one complete word in the text in the cell according to the word segmentation processing result; Determining at least one target word from the at least one complete word; and adding a separation mark in each target word.
6. A method of training a text processing model, the method comprising: collecting a plurality of first initial sample texts, preprocessing the first initial sample texts to obtain a first sample training text set, wherein the first sample training text set comprises a plurality of first training sample texts, each first training sample text is provided with marking information, the marking information comprises label information of a separation mark added to the first training sample text and the position of the separation mark, and the label information of the separation mark is used for indicating whether texts at the position of the separation mark need to be combined or not; The text processing model is obtained through training, wherein the training step of the text processing model comprises the steps of inputting each first training sample text with marking information into an initial text processing model, predicting and obtaining a predicted result of each separation mark in each first training sample text according to each first training sample text and the position of each separation mark in each first training sample text by the initial text processing model, wherein the predicted result is used for indicating the probability that texts at the positions of the separation marks do not need to be combined, calculating loss information of the initial text processing model according to the predicted result of each separation mark in each first training sample text and the label information of each separation mark, and iteratively correcting network parameters of the initial text processing model to obtain the text processing model.
7. The method of claim 6, wherein collecting a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set, comprises: Extracting a plurality of first initial sample texts from at least one sample file with a preset format, wherein each first initial sample text comprises the text of at least one cell in the sample file; noise reduction is carried out on each first initial sample text, non-text characters in each first initial sample text are deleted, and a first preprocessed sample text corresponding to each first initial sample text is obtained; Adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain a first training sample text; And obtaining the first sample training text set according to each first training sample text.
8. The method of claim 6, wherein training the text set using the first sample to obtain the text processing model comprises: acquiring a second sample training text set corresponding to the target field, wherein labeling information of each second training sample text in the second sample training text set is labeled by a user; And training to acquire the text processing model by adopting the first sample training text set and the second sample training text set.
9. The method of claim 7, wherein extracting a plurality of first initial sample text from at least one sample file having a preset format comprises: And sequentially extracting a whole column of cell texts from the wired table in at least one sample file with a preset format according to column directions, and sequentially concatenating the texts as a first initial sample text.
10. The method of claim 7, wherein the denoising each first initial sample text and deleting non-text characters in each first initial sample text to obtain a first preprocessed sample text corresponding to each first initial sample text, comprises: And performing full-angle and half-angle processing on the first initial sample text, and deleting non-text characters in the first initial sample text to obtain a first preprocessed sample text corresponding to the first initial sample text, wherein the non-text characters comprise preset separators, spaces, hypertext markup language labels and Chinese messy codes.
11. The method of claim 7, wherein adding a separator mark to the text of at least one cell in the first preprocessed sample text to obtain the first training sample text, comprises: and deleting the text with the preset length from the sample text after the first pretreatment if the character length of the sample text after the first pretreatment after the separation mark is inserted currently meets the preset length or the number of the inserted separation marks meets the preset number, so as to obtain the first training sample text.
12. The text processing device is characterized by comprising a reading module, a marking module and a processing module; The reading module is used for reading the text of at least one cell in the file to be processed; The marking module is used for adding a separation mark to the text of the at least one cell to obtain a target processing text; The processing module is used for inputting the target processing text into a pre-trained text processing model, identifying whether the segmented text of each separation mark in the target processing text needs to be combined or not, carrying out combination processing on the target processing text according to the identification result to obtain at least one target text, training the text processing model by adopting the training sample text with label information, wherein the label information comprises label information of the separation mark added into the training sample text and the position of the separation mark, the label information of the separation mark is used for indicating whether the text of the position of the separation mark needs to be combined or not, generating the label information based on the real semantics of the text of the position of the separation mark in the training sample text, inputting the training sample text with label information into an initial text processing model, predicting the prediction result of each separation mark by the initial text processing model according to the training sample text and the position of the separation mark in the training sample text, the prediction result of each separation mark is used for indicating the probability that the text of the separation mark does not need to be combined, calculating the text of the separation mark at the position of the separation mark according to the prediction result of each separation mark and the separation mark, carrying out iterative processing of the text processing model, identifying whether the text of the separation mark in the target processing text is required to be combined or not, wherein the training model is processed by the training sample text processing model comprises the initial text processing model, and determining whether texts segmented by each separation mark need to be combined according to the text combining mode, wherein the text combining mode indicated by the separation mark comprises that the texts need to be combined and the texts do not need to be combined.
13. The text processing model training device is characterized by comprising an acquisition module and a training module; The acquisition module is used for acquiring a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set, wherein the first sample training text set comprises a plurality of first training sample texts, each first training sample text is provided with marking information, the marking information comprises label information of a separation mark added into the first training sample text and the position of the separation mark, and the label information of the separation mark is used for indicating whether texts at the position of the separation mark need to be combined or not; The training module is used for training the text set by adopting the first samples to obtain a text processing model, wherein the training step of the text processing model comprises the steps of inputting each first training sample text with marking information into an initial text processing model, predicting and obtaining a predicted result of each separation mark in each first training sample text by the initial text processing model according to each first training sample text and the position of each separation mark in each first training sample text, wherein the predicted result is used for indicating the probability that texts at the positions of the separation marks do not need to be combined, calculating the loss information of the initial text processing model according to the predicted result of each separation mark in each first training sample text and the label information of each separation mark, and iteratively correcting the network parameters of the initial text processing model to obtain the text processing model.
14. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is in operation, the processor executing the program instructions to perform the steps of the text processing method of any one of claims 1 to 5 or the steps of the text processing model training method of any one of claims 6 to 11 when executed.
15. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the text processing method according to any one of claims 1 to 5 or the steps of the text processing model training method according to any one of claims 6 to 11.

Description

Text processing method, model training method, device, equipment and storage medium Technical Field The application relates to the technical field of neural networks, in particular to a text processing method, a model training method, a device, equipment and a storage medium. Background In the text processing process, the problem of complex form merging is often encountered, especially for the page-crossing text and the wireless form text, the merging relation of the text cannot be simply judged from the information such as the uplink and downlink spacing, the indentation and the like. In the prior art, whether texts between two pairs are combined is generally regarded as a task of two classification, and a plurality of lines of texts are compared in sequence every two pairs, so that whether the two texts are combined is respectively judged, and a final combination result is obtained. Therefore, the text merging efficiency by adopting the method is low. Disclosure of Invention The application aims to provide a text processing method, a model training method, a device, equipment and a storage medium aiming at the defects in the prior art so as to solve the problem of low text merging processing efficiency in the prior art. In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows: In a first aspect, an embodiment of the present application provides a text processing method, including: Reading the text of at least one cell in the file to be processed; adding a separation mark to the text of the at least one cell to obtain a target processing text; The method comprises the steps of inputting a target processing text into a pre-trained text processing model, identifying whether texts segmented by separation marks in the target processing text need to be combined or not, combining the target processing text according to identification results to obtain at least one target text, training the text processing model by using a training sample text with marking information, wherein the marking information comprises label information added to the separation marks in the training sample text and positions of the separation marks, and the label information is generated based on real semantics of the text where the separation marks are located in the training sample text. Optionally, the adding a separation mark to the text of the at least one cell to obtain target processing text includes: and adding a separation mark between texts of each adjacent cell to obtain target processing texts. Optionally, the adding a separation mark to the text of the at least one cell to obtain target processing text includes: And adding separation marks between texts of each adjacent cell, and adding the separation marks between texts in each cell to obtain target processing texts. Optionally, the adding a separation mark between the texts in each cell includes: and inserting a separation mark in at least one random position of the text in each cell to obtain the target processing text. Optionally, the adding a separation mark between the texts in each cell includes: Word segmentation is carried out on the texts in the cells, and word segmentation processing results are obtained; Determining at least one complete word in the text in the cell according to the word segmentation processing result; Determining at least one target word from the at least one complete word; and adding a separation mark in each target word. In a second aspect, an embodiment of the present application provides a text processing model training method, including: Collecting a plurality of first initial sample texts, preprocessing the first initial sample texts to obtain a first sample training text set, wherein the first sample training text set comprises a plurality of first training sample texts, each first training sample text is provided with labeling information, the labeling information comprises label information of separation marks added into the first training sample texts and positions of the separation marks, and the label information is generated based on real semantics of texts of the positions of the separation marks in the first training sample texts; And training the text set by adopting the first sample to acquire a text processing model. Optionally, the collecting a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set includes: Extracting a plurality of first initial sample texts from at least one sample file with a preset format, wherein each first initial sample text comprises the text of at least one cell in the sample file; noise reduction is carried out on each first initial sample text, non-text characters in each first initial sample text are deleted, and a first preprocessed sample text corresponding to each first initial sample text is obtained; Adding a separation mark to the text of at least one cell