CN-122020177-A - Training sample generation method and device, electronic equipment and storage medium
Abstract
The application provides a training sample generation method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining an initial question-answer data pair, inputting the initial question-answer data pair into a first preset model to obtain target question-answer data pairs with different expression visual angles, inputting a legal document into a second preset model to obtain a hierarchical data structure which is output after hierarchical structured extraction, inputting the legal document into a third preset model to obtain a plurality of knowledge items which are output after glossary analysis, and integrating the target question-answer data pair, the hierarchical data structure and the knowledge items to form a comprehensive training sample set for training a legal large model. The embodiment of the application is beneficial to improving the training precision of the legal big model through the method.
Inventors
- LI FENG
- Zhang Lekang
- MAO JIAWEI
Assignees
- 武汉新致数字科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260209
Claims (10)
- 1. A method of training sample generation, the method comprising: acquiring an initial question-answer data pair, wherein the initial question-answer data pair is generated based on a preset legal document; inputting the initial question-answer data pair into a first preset model to obtain target question-answer data pairs which are generated by the first preset model based on the initial question-answer data pair and have different expression visual angles, wherein the expression visual angles comprise at least two of a control visual angle, a dialect visual angle and a referee visual angle; Inputting the legal document into a second preset model to obtain a hierarchical data structure which is output after the second preset model performs hierarchical structured extraction on the legal document, wherein the hierarchical data structure at least comprises a basic information layer, a logic chain layer of facts and evidence layers and a judge logic layer; inputting the legal document into a third preset model to obtain a plurality of knowledge items which are output by the third preset model after the situational analysis of the terms from the legal document, wherein each knowledge item at least comprises an applicable situation of each term in the legal document, and an incorrect meaning and a correct meaning which are generated based on the applicable situation; and integrating the target question-answer data pair, the hierarchical data structure and the knowledge item to form a comprehensive training sample set for training a legal big model.
- 2. The method according to claim 1, wherein inputting the initial question-answer data pair into a first preset model, obtaining a target question-answer data pair with different expression perspectives, which is generated based on the initial question-answer data pair, by the first preset model, includes: calling a preset first script, and executing the following steps: Constructing a prompt template comprising a visual angle control instruction and a legal constraint instruction, wherein the legal constraint instruction is used for constraining the generated content of the first preset model to be consistent with the initial question-answer data pair logic; calling an interface of the first preset model, and taking the initial question-answer data pair and the prompt template as input of the first preset model; receiving first output content returned by the first preset model, wherein the first output content comprises question-answer pairs which are repeated from a plurality of different perspectives for the initial question-answer data pairs; and analyzing and verifying the format of the first output content to obtain the target question-answer data pair meeting the preset requirement.
- 3. The method of claim 1, wherein inputting the legal document into a second preset model to obtain a hierarchical data structure output after the second preset model performs hierarchical structured extraction on the legal document, comprises: Calling a preset second script, and executing the following steps: Constructing a hierarchically structured extracted prompt template, wherein the prompt template comprises prompt information of a basic information layer, prompt information of a fact and evidence layer and prompt information of a judge logic layer; Calling an interface of the second preset model, and taking the legal document and the prompt template as input of the second preset model; receiving and analyzing second output content returned by the second preset model; And performing missing value filling and standardization processing on the second output content to obtain the hierarchical data structure.
- 4. The method of claim 1, wherein inputting the legal document into a third predetermined model, obtaining a plurality of knowledge items output by the third predetermined model after the situational analysis of terms from the legal document, comprises: Calling a preset third script, and executing the following steps: the method comprises the steps of constructing a prompting template for contextualized analysis of terms, wherein the prompting template comprises a related instruction for term identification and context, an extraction instruction for original text interpretation and applicability analysis and a generation instruction for incorrect meaning and correct meaning; calling an interface of the third preset model, and taking the legal document and the prompt template as input of the third preset model; Receiving and analyzing third output content returned by the third preset model based on the prompt template; and carrying out structural arrangement on the third output content to obtain the knowledge item.
- 5. The method of claim 1, wherein integrating the target question-answer data pair, the hierarchical data structure, and the knowledge item comprises: and respectively converting the target question-answer data pair, the hierarchical data structure and the knowledge item into a unified training data format, and mixing and sampling to form the comprehensive training sample set.
- 6. The method of claim 2, wherein the hint template further comprises: And the legal constraint instruction is used for indicating the first preset model to return a self-checking result of consistency of criminal names and sentences in the first output content so that the first script filters the generated content based on the self-checking result.
- 7. The method of claim 6, wherein the hint template comprises: The first script aims at each initial question-answer data pair, and a plurality of prompt templates which are automatically generated and comprise different expression visual angles and program stage parameter combinations are used for driving the first preset model to generate the multi-dimensional target question-answer data pair.
- 8. A training sample generation apparatus, the apparatus comprising: the acquisition module is used for acquiring an initial question-answer data pair, wherein the initial question-answer data pair is generated based on a preset legal document; The first module is used for inputting the initial question-answer data pair into a first preset model to obtain target question-answer data pairs which are generated by the first preset model based on the initial question-answer data pair and have different expression visual angles, wherein the expression visual angles comprise at least two of a control visual angle, a dialect visual angle and a referee visual angle; the second module is used for inputting the legal document into a second preset model to obtain a hierarchical data structure which is output after the second preset model performs hierarchical structure extraction on the legal document, wherein the hierarchical data structure at least comprises a basic information layer, a logic chain layer of a fact and evidence layer and a judge logic layer; the third module is used for inputting the legal document into a third preset model to obtain a plurality of knowledge items which are output after the third preset model carries out contextualization analysis on terms from the legal document, wherein each knowledge item at least comprises an applicable situation of each term in the legal document and error meaning and correct meaning generated based on the applicable situation; And the integration module is used for integrating the target question-answer data pair, the hierarchical data structure and the knowledge item to form a comprehensive training sample set for training a legal big model.
- 9. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the training sample generation method of any of claims 1 to 7.
- 10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, performs the steps of the training sample generation method according to any of claims 1 to 7.
Description
Training sample generation method and device, electronic equipment and storage medium Technical Field The application relates to the technical field of artificial intelligence, in particular to a training sample generation method, a training sample generation device, electronic equipment and a storage medium. Background As the application of large language models in the fields of legal services, judicial assistance, intelligent consultation and the like is increasingly in progress, the requirements on the professionality, accuracy and robustness of the large language models are also sharply improved. A high-performance legal field large model is trained, and the core is to obtain high-quality, multidimensional and professional logic-rich training data. However, when training the model currently, there are problems of data scarcity and homogenization, so that training samples are not abundant enough. Based on this, it is desirable to provide a solution. Disclosure of Invention In view of the above, the embodiments of the present application provide a training sample generation method, apparatus, electronic device, and storage medium, which aim to efficiently mine and construct a multi-dimensional, high-quality training sample from a legal document source through a systematic and automatic data processing flow, so as to significantly enhance the training accuracy of a legal large model. In a first aspect, an embodiment of the present application provides a training sample generating method, where the method includes: acquiring an initial question-answer data pair, wherein the initial question-answer data pair is generated based on a preset legal document; inputting the initial question-answer data pair into a first preset model to obtain target question-answer data pairs which are generated by the first preset model based on the initial question-answer data pair and have different expression visual angles, wherein the expression visual angles comprise at least two of a control visual angle, a dialect visual angle and a referee visual angle; Inputting the legal document into a second preset model to obtain a hierarchical data structure which is output after the second preset model performs hierarchical structured extraction on the legal document, wherein the hierarchical data structure at least comprises a basic information layer, a logic chain layer of facts and evidence layers and a judge logic layer; inputting the legal document into a third preset model to obtain a plurality of knowledge items which are output by the third preset model after the situational analysis of the terms from the legal document, wherein each knowledge item at least comprises an applicable situation of each term in the legal document, and an incorrect meaning and a correct meaning which are generated based on the applicable situation; and integrating the target question-answer data pair, the hierarchical data structure and the knowledge item to form a comprehensive training sample set for training a legal big model. In one possible embodiment, the inputting the initial question-answer data pair into a first preset model, to obtain a target question-answer data pair with different expression perspectives generated by the first preset model based on the initial question-answer data pair, includes: calling a preset first script, and executing the following steps: Constructing a prompt template comprising a visual angle control instruction and a legal constraint instruction, wherein the legal constraint instruction is used for constraining the generated content of the first preset model to be consistent with the initial question-answer data pair logic; calling an interface of the first preset model, and taking the initial question-answer data pair and the prompt template as input of the first preset model; receiving first output content returned by the first preset model, wherein the first output content comprises question-answer pairs which are repeated from a plurality of different perspectives for the initial question-answer data pairs; and analyzing and verifying the format of the first output content to obtain the target question-answer data pair meeting the preset requirement. In a possible embodiment, inputting the legal document into a second preset model to obtain a hierarchical data structure output after the second preset model performs hierarchical structure extraction on the legal document, where the hierarchical data structure includes: Calling a preset second script, and executing the following steps: constructing a hierarchically structured extracted prompt template, wherein the prompt template comprises prompt information of a basic information layer, prompt information of a fact and evidence layer and prompt information of a judge logic layer; Calling an interface of the second preset model, and taking the legal document and the prompt template as input of the second preset model; receiving and analyzing second output