CN-121997897-A - Corpus processing method and device

CN121997897ACN 121997897 ACN121997897 ACN 121997897ACN-121997897-A

Abstract

The embodiment of the application provides a corpus processing method and device, and relates to the technical field of artificial intelligence, wherein the corpus processing method comprises the steps of storing content of original corpus according to storage specifications and entity identifiers to obtain one or more first storage entities, preprocessing the corpus based on each first storage entity to obtain one or more second storage entities respectively corresponding to different second identifiers, recording a first association relation between each second identifier in the different second identifiers and the first identifier of the corresponding first storage entity, wherein each second storage entity is used for indicating content of pre-trained corpus corresponding to the original corpus, and updating the content of the original corpus or the content of the pre-trained corpus corresponding to the changed content in the first corpus according to the first association relation under the condition that the stored content of the original corpus or the pre-trained corpus is changed. The scheme of the application can improve the corpus quality.

Inventors

YAO YING
ZHANG BO
YANG XIANGFENG
SUN GAOJIE
CHEN TAO
Deer Wild
ZHANG YONG
XUE JINGQIAN

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (20)

1. A corpus processing method, the method comprising: According to the storage specification and the entity identification, storing the content of the original corpus to obtain one or more first storage entities, wherein each first storage entity corresponds to a first identification; preprocessing corpus based on each first storage entity to obtain one or more second storage entities respectively corresponding to different second identifications, and recording a first association relationship between each second identification in the different second identifications and the first identification of the corresponding first storage entity, wherein each second storage entity in the one or more second storage entities is used for indicating the content of the pre-training corpus corresponding to the original corpus; Under the condition that the stored content of the first corpus is changed, according to the first association relation, updating the content of the original corpus or the content of the pre-training corpus which is associated with the changed content in the first corpus and corresponds to the change, wherein the first corpus comprises the original corpus or the pre-training corpus.
2. The method of claim 1, wherein the content of the original corpus comprises text content, and the storage specification comprises one or more editable text blocks; Storing the content of the original corpus according to the storage specification and the entity identifier to obtain one or more first storage entities, wherein each first storage entity corresponds to a first identifier, and the method comprises the following steps: Dividing the content of the original corpus according to the storage specification, and extracting the divided content to obtain a plurality of storage entities; Marking different entity identifications for each storage entity in the plurality of storage entities respectively to obtain a plurality of first storage entities respectively corresponding to different first identifications; the corpus preprocessing is performed based on each first storage entity to obtain one or more second storage entities respectively corresponding to different second identifications, including: Each first storage entity in the plurality of first storage entities is subjected to corpus extraction and corpus cleaning respectively to obtain a second storage entity corresponding to the first storage entity; And marking different second identifiers for each second storage entity respectively to obtain a plurality of second storage entities respectively corresponding to the different second identifiers.
3. The method according to claim 2, wherein in the case of the stored content of the first corpus being changed, updating the content of the original corpus or the content of the pre-trained corpus associated with the changed content in the first corpus according to the first association relation, includes: Under the condition that the content of the stored first corpus is changed, determining a first identification of a first updated storage entity of the content change according to an editable text block forming the changed content, wherein the first corpus comprises the original corpus; Extracting and cleaning the corpus of the first updating storage entity to obtain a second updating storage entity corresponding to the first identifier of the first updating storage entity; and updating the second storage entity marked by the second identifier corresponding to the first identifier of the first updating storage entity into the second updating storage entity corresponding to the first identifier of the first updating storage entity according to the first association relation.
4. The method according to claim 2, wherein in the case of the stored content of the first corpus being changed, updating the content of the original corpus or the content of the pre-trained corpus associated with the changed content in the first corpus according to the first association relation, includes: under the condition that the content of the stored first corpus is changed, determining a second identification of a second updated storage entity of the content change according to an editable text block forming the changed content, wherein the first corpus comprises the pre-training corpus; and updating the editable text block in the second storage entity marked by the first identifier corresponding to the second identifier of the second updating storage entity into the editable text block in the second updating storage entity according to the first association relation.
5. The method according to any one of claims 1 to 4, wherein after the recording of the first association between each of the different second identities and the first identity of the corresponding first storage entity, the method further comprises: outputting the stored content of the original corpus and the content of the pre-training corpus corresponding to the original corpus to a visualization device for display according to the first association relation, and/or, According to the first association relation, obtaining comparison information between the stored content of the original corpus and the content of the pre-training corpus corresponding to the original corpus, and outputting the comparison information to a visualization device for display.
6. The method of claim 5, wherein the comparison information is used to indicate one or more of the same content, a relatively deleted content, and a relatively newly added content between the content of the stored original corpus and the pre-training corpus corresponding to the original corpus.
7. The method according to any one of claims 1 to 6, wherein after the recording of the first association between each of the different second identities and the first identity of the corresponding first storage entity, the method further comprises: performing corpus labeling processing based on each second storage entity in the one or more second storage entities to obtain one or more fine-tuning corpora respectively corresponding to different third identifications; recording a second association relationship between each third identifier in the different third identifiers and a second identifier of a corresponding second storage entity; Under the condition that the stored content of the second corpus is changed, updating the content of the pre-training corpus or the content of the fine-tuning corpus which is associated with the changed content in the second corpus according to the second association relation, wherein the second corpus comprises any one of the pre-training corpus or the one or more fine-tuning corpuses.
8. The method according to claim 7, wherein in the case of the stored content of the second corpus being changed, updating the content of the pre-training corpus or the content of the fine-tuning corpus associated with the changed content in the second corpus according to the second association relationship, includes: Under the condition that the stored content of the second corpus is changed, obtaining updated content related to the changed content in the second corpus based on the changed content in the second corpus, a prompt and a fine tuning model obtained by pre-training, wherein the prompt is used for indicating the fine tuning model to output content conforming to the form of the pre-training corpus or content conforming to the form of the fine tuning corpus, and the fine tuning model is a model obtained by training by utilizing a sample corpus, an adjustment label of the sample corpus and a prompt corresponding to an adjustment target; and according to the second association relation, updating the content of the pre-training corpus or the content of the fine-tuning corpus associated with the changed content in the second corpus into the updated content associated with the changed content in the second corpus.
9. A corpus processing apparatus, the apparatus comprising: the corpus storage module is used for storing the content of the original corpus according to the storage specification and the entity identification to obtain one or more first storage entities, and each first storage entity corresponds to one first identification; The corpus association module is used for preprocessing the corpus based on each first storage entity to obtain one or more second storage entities respectively corresponding to different second identifications, and recording a first association relation between each second identification in the different second identifications and the first identification of the corresponding first storage entity, wherein each second storage entity in the one or more second storage entities is used for indicating the content of the pre-training corpus corresponding to the original corpus; the corpus synchronization module is used for updating the content of the original corpus or the content of the pre-trained corpus associated with the changed content in the first corpus according to the first association relation under the condition that the stored content of the first corpus is changed, wherein the first corpus comprises the original corpus or the pre-trained corpus.
10. The apparatus of claim 9, wherein the content of the original corpus comprises text content, and the storage specification comprises one or more editable text blocks; the corpus storage module is specifically configured to: Dividing the content of the original corpus according to the storage specification, and extracting the divided content to obtain a plurality of storage entities; Marking different entity identifications for each storage entity in the plurality of storage entities respectively to obtain a plurality of first storage entities respectively corresponding to different first identifications; the corpus association module is specifically configured to: Each first storage entity in the plurality of first storage entities is subjected to corpus extraction and corpus cleaning respectively to obtain a second storage entity corresponding to the first storage entity; And marking different second identifiers for each second storage entity respectively to obtain a plurality of second storage entities respectively corresponding to the different second identifiers.
11. The apparatus of claim 10, wherein the corpus synchronization module is specifically configured to: Under the condition that the content of the stored first corpus is changed, determining a first identification of a first updated storage entity of the content change according to an editable text block forming the changed content, wherein the first corpus comprises the original corpus; Extracting and cleaning the corpus of the first updating storage entity to obtain a second updating storage entity corresponding to the first identifier of the first updating storage entity; and updating the second storage entity marked by the second identifier corresponding to the first identifier of the first updating storage entity into the second updating storage entity corresponding to the first identifier of the first updating storage entity according to the first association relation.
12. The apparatus of claim 10, wherein the corpus synchronization module is specifically configured to: under the condition that the content of the stored first corpus is changed, determining a second identification of a second updated storage entity of the content change according to an editable text block forming the changed content, wherein the first corpus comprises the pre-training corpus; and updating the editable text block in the second storage entity marked by the first identifier corresponding to the second identifier of the second updating storage entity into the editable text block in the second updating storage entity according to the first association relation.
13. The apparatus according to any one of claims 9 to 12, further comprising a visual presentation module for: outputting the stored content of the original corpus and the content of the pre-training corpus corresponding to the original corpus to a visualization device for display according to the first association relation, and/or, According to the first association relation, obtaining comparison information between the stored content of the original corpus and the content of the pre-training corpus corresponding to the original corpus, and outputting the comparison information to a visualization device for display.
14. The apparatus of claim 13, wherein the comparison information is used to indicate one or more of identical content, relatively deleted content, and relatively newly added content between the content of the stored original corpus and the pre-training corpus corresponding to the original corpus.
15. The apparatus according to any one of claims 9 to 14, wherein the corpus association module is further configured to: After the first association relation between each second identifier in the different second identifiers and the first identifier of the corresponding first storage entity is recorded, corpus labeling processing is carried out on the basis of each second storage entity in the one or more second storage entities, so as to obtain one or more fine-tuning corpuses respectively corresponding to different third identifiers; the corpus synchronization module is further configured to: Under the condition that the stored content of the second corpus is changed, updating the content of the pre-training corpus or the content of the fine-tuning corpus which is associated with the changed content in the second corpus according to the second association relation, wherein the second corpus comprises any one of the pre-training corpus or the one or more fine-tuning corpuses.
16. The apparatus of claim 15, wherein the corpus synchronization module is specifically configured to: Under the condition that the stored content of the second corpus is changed, obtaining updated content related to the changed content in the second corpus based on the changed content in the second corpus, a prompt and a fine tuning model obtained by pre-training, wherein the prompt is used for indicating the fine tuning model to output content conforming to the form of the pre-training corpus or content conforming to the form of the fine tuning corpus, and the fine tuning model is a model obtained by training by utilizing a sample corpus, an adjustment label of the sample corpus and a prompt corresponding to an adjustment target; and according to the second association relation, updating the content of the pre-training corpus or the content of the fine-tuning corpus associated with the changed content in the second corpus into the updated content associated with the changed content in the second corpus.
17. A computing device, comprising: the device comprises a processor and a memory, wherein the processor is connected with the memory; The memory is used for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
18. A computer readable storage medium comprising a computer program, characterized in that the computer program, when run on a computing device, causes the computing device to perform the method of any of claims 1 to 8.
19. A chip comprising one or more interface circuits and one or more processors, the interface circuits to receive signals from a memory of a computing device and to send the signals to the processor, the signals comprising computer instructions stored in the memory, which when executed by the processor, cause the computing device to perform the method of any one of claims 1 to 8.
20. A computer program product comprising a computer program which, when executed by a computing device, causes the computing device to perform the method of any of claims 1 to 8.

Description

Corpus processing method and device Technical Field The embodiment of the application relates to the technical field of artificial intelligence, in particular to a corpus processing method and device. Background As technology and demand evolves, artificial intelligence moves from "model-centric" to "data-centric," where data primarily includes corpora. The accuracy of the model can be improved by high-quality corpus, so that the requirement on the corpus quality is higher and the demand on the corpus quantity is larger and higher in the training of the model. In the related art, an original corpus is mainly obtained through data acquisition, the original corpus is cleaned to obtain a pre-training corpus, and the pre-training corpus is marked to obtain a marked corpus for training a model. For example, for the acquisition of the pre-training corpus, online data resources can be processed through export, format conversion, log recording, copy acquisition and the like to obtain an original corpus, the content in the original corpus is extracted and uploaded to a data cleaning platform, and corpus preprocessing, namely corpus extraction and corpus cleaning, is performed through the data cleaning platform to obtain the pre-training corpus. However, in a specific application, online data resources are easy to change along with time, original corpus is changed along with the change, and pre-training corpus can be subjected to correction, modification and other treatments. At this time, if the contents of the original corpus and the pre-training corpus are not synchronous, the pre-training corpus is not traceable, so that the credibility and the safety of the pre-training corpus and the original corpus are reduced, and the problem of low corpus quality is caused. Disclosure of Invention In order to solve the technical problems, the application provides a corpus processing method and device. According to the corpus processing method, through the association relation between the first storage entity indicating the content of the original corpus and the second identification of the first storage entity indicating the content of the pre-training corpus, bidirectional synchronization of the content between the original corpus and the pre-training corpus is achieved, the credibility and the safety of the pre-training corpus and the original corpus are improved, and therefore the quality of the pre-training corpus and the original corpus is improved. According to a first aspect, the embodiment of the application provides a corpus processing method, which comprises the steps of storing content of an original corpus according to storage specifications and entity identifications to obtain one or more first storage entities, preprocessing the corpus based on each first storage entity to obtain one or more second storage entities respectively corresponding to different second identifications, recording first association relations between each second identification in the different second identifications and the first identifications of the corresponding first storage entities, wherein each second storage entity in the one or more second storage entities is used for indicating content of a pre-trained corpus corresponding to the original corpus, and updating the content of the original corpus or the content of the pre-trained corpus corresponding to the changed content in the first corpus according to the first association relations under the condition that the content of the stored first corpus is changed, wherein the first corpus comprises the original corpus or the pre-trained corpus. According to the embodiment of the application, the first storage entity corresponding to the original corpus and the second storage entity corresponding to the pre-training corpus are marked through the associated first identification and second identification respectively, so that the association of the content of the original corpus and the content of the pre-training corpus is ensured. Therefore, when the content of any corpus is changed and stored, the content of the other related corpus can be correspondingly and synchronously updated, so that bidirectional synchronization of the content between the original corpus and the pre-training corpus is realized, homology, updatability and traceability of the original corpus and the pre-training corpus are ensured, the credibility and the safety of the pre-training corpus are improved, and the quality of the pre-training corpus is improved. According to the first aspect, the content of the original corpus comprises text content, the storage specification comprises one or more editable text blocks, the content of the original corpus is stored according to the storage specification and entity identifications to obtain one or more first storage entities, each first storage entity corresponds to one first identification, the method comprises the steps of dividing the content of the original corpus ac