CN-121996761-A - Method, device, medium and electronic equipment for generating text question-answer pair data set

CN121996761ACN 121996761 ACN121996761 ACN 121996761ACN-121996761-A

Abstract

The application discloses a method, a device, a medium and electronic equipment for generating a text question-answer pair data set, and relates to the field of natural language processing. The method comprises the steps of carrying out hierarchical standardized analysis on a text to be processed, determining a chapter logic hierarchical structure, extracting candidate knowledge points based on the chapter logic hierarchical structure, reversely aggregating semantic contexts by taking the candidate knowledge points as cores to generate structural quaternions, positioning each structural quaternion in the text to be processed to obtain text positions corresponding to the structural quaternions, aggregating the structural quaternions meeting similarity requirements and the corresponding text positions based on similarity among knowledge points in different structural quaternions to generate an index table corresponding to the knowledge points, and generating a question-answer data set corresponding to the text to be processed based on the index table. By associating the knowledge points scattered in different hierarchical units with the index table, the defect that the traditional method is limited to the context and cannot reflect the whole knowledge structure is overcome.

Inventors

HOU XIA
LIU CHENYANG

Assignees

北京信息科技大学

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (10)

1. A method for generating a text question-answer pair data set, comprising: Carrying out hierarchical standardized analysis on the text to be processed, and determining a chapter logic hierarchical structure; Extracting candidate knowledge points based on the chapter logic hierarchical structure, and reversely aggregating semantic contexts by taking the candidate knowledge points as cores to generate a structured tetrad, wherein the structured tetrad comprises chapters, sections, knowledge points and contexts; Positioning each structured tetrad in the text to be processed to obtain a text position corresponding to the structured tetrad; Based on the similarity between knowledge points in different structured tetrads, aggregating the structured tetrads meeting the similarity requirement and corresponding text positions to generate an index table corresponding to the knowledge points, wherein the index table is used for establishing and recording association relations between the knowledge points which are semantically related and distributed in different hierarchical structure units in the text to be processed; And generating a question-answer data set corresponding to the text to be processed based on the index table.
2. The method of claim 1, wherein reverse aggregating semantic contexts with the candidate knowledge points as cores generates a structured quadruple comprising: extracting knowledge points in each chapter through a keyword extraction strategy of TF-IDF; the structured quadruples are generated based on knowledge points in each section and the section, and context in which the knowledge points are located.
3. The method of claim 1, wherein locating each of the structured quaternions in the text to be processed to obtain a text position corresponding to the structured quaternion comprises: determining the start-stop positions of knowledge points in the structured tetrad in the text to be processed through a three-level matching strategy; wherein, the tertiary matching strategy comprises: matching knowledge points in the structured tetrad in the text to be processed to obtain a start-stop position of the structured tetrad in the text to be processed; Under the condition that knowledge points in the structured tetrad are not successfully matched in the text to be processed, calculating the similarity between the knowledge points in the structured tetrad and each text segment in the text to be processed, and taking the text segment with the similarity greater than or equal to a first preset threshold value as a start-stop position of the structured tetrad in the text to be processed; And under the condition that the similarity is smaller than the first preset threshold value, calling a large language model to determine the start and stop positions of the structured tetrads in the text to be processed.
4. The method of claim 3, wherein invoking a large language model to determine a start-stop position of the structured tetrad in the text to be processed comprises: Dividing the text to be processed into a plurality of text blocks with overlapped contents; And calling a large language model in each text block to carry out semantic recognition so as to determine the start and stop positions of the structured tetrads in the text to be processed.
5. The method according to claim 1, wherein the aggregating the structured quaternion meeting the similarity requirement and the corresponding text position based on the similarity between the knowledge points in the different structured quaternions to generate the index table corresponding to the knowledge points includes: generating a semantic vector for knowledge points in each structured quadruple; And calculating the similarity between the semantic vectors, and aggregating two structured tetrads with the similarity between the semantic vectors larger than a second preset threshold value and corresponding text positions into the same cluster to generate an index table corresponding to the knowledge points.
6. The method according to claim 1, wherein generating the question-answer data set corresponding to the text to be processed based on the index table includes: based on each structured tetrad and the positioned context thereof in the index table, converting knowledge points and contexts in the structured tetrad into answer expressions through a large language model; generating corresponding questions based on the answer expressions, and constructing question-answer pairs; The consistency check comprises judging whether the answer can directly answer the question or not and judging whether the answer is supported by the context or not; And under the condition of passing the consistency check, constructing a question-answer pair data set based on the question-answer pair passing the consistency check until the question-answer pair data set corresponding to the text to be processed is obtained.
7. The method of claim 6, wherein after the consistency check of the question-answer pair and prior to constructing the question-answer pair dataset, the method further comprises: and carrying out de-duplication treatment on the questions in the question-answer pair passing the verification.
8. A device for generating a text question-answer pair data set, comprising: The analysis unit is configured to perform hierarchical standardized analysis on the text to be processed and determine a chapter logic hierarchical structure; The extraction unit is configured to extract candidate knowledge points based on the chapter logic hierarchical structure, and reversely aggregate semantic contexts by taking the candidate knowledge points as cores to generate a structured quadruple, wherein the structured quadruple comprises chapters, sections, knowledge points and contexts; The positioning unit is configured to position each structured tetrad in the text to be processed to obtain a text position corresponding to the structured tetrad; The aggregation unit is configured to aggregate the structured tetrads meeting the similarity requirement and the corresponding text positions based on the similarity between knowledge points in different structured tetrads to generate an index table corresponding to the knowledge points, wherein the index table is used for establishing and recording association relations between the knowledge points which are related to semantics and distributed in different hierarchical structure units in the text to be processed; and the generating unit is configured to generate a question-answer data set corresponding to the text to be processed based on the index table.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 7.

Description

Method, device, medium and electronic equipment for generating text question-answer pair data set Technical Field The application relates to the technical field of natural language processing, in particular to a method, a device, a medium and electronic equipment for generating a text question-answer pair data set. Background Along with the rapid development of artificial intelligence technology in the fields of intelligent education, self-adaptive learning systems and knowledge service, a high-quality question-answer data set is automatically built from massive text resources (especially textbooks and electronic books), and the task has become a key task for supporting intelligent question-answer, online evaluation and large-model fine adjustment. At present, the related art is often limited to a single paragraph or local text in terms of question-answer pair generation, the basic idea is to generate corresponding questions and answers from an isolated paragraph or sentence, and the early generation method based on a sequence-to-sequence neural network model and the generation method based on a pre-training language model in recent years belong to the same. Although fluency, grammatical problems can be generated on a particular dataset, it is fundamentally limited by the narrowness of the field of view of the context. They often cannot capture knowledge links beyond the current paragraph, and core concepts, principles and application scenarios in the textbook tend to be scattered among different chapters, subsections. Therefore, the problems generated based on a single paragraph are mostly "surface questions" with facts and memory, and lack deep investigation and multi-angle understanding of a knowledge system, so that the requirements of knowledge association and reasoning capability in higher education and professional training are difficult to meet. Therefore, how to generate question-answer pairs with cross-level knowledge connection, clear logic and verifiable answers for books with self-level knowledge structures, such as textbooks, e-books and the like, becomes a problem to be solved in the present day. Disclosure of Invention In view of the above, the application provides a method, a device, a medium and an electronic device for generating a text question-answer pair data set, which aim to solve the problems that the existing method for generating a text question-answer is limited to single paragraphs or local contexts, is difficult to embody inter-chapter knowledge connection inside the text, and the generated answer lacks traceability and verifiability of original text evidence. In a first aspect, the present application provides a method for generating a text question-answer pair data set, including: Carrying out hierarchical standardized analysis on the text to be processed, and determining a chapter logic hierarchical structure; Extracting candidate knowledge points based on the chapter logic hierarchical structure, and reversely aggregating semantic contexts by taking the candidate knowledge points as cores to generate a structured tetrad, wherein the structured tetrad comprises chapters, sections, knowledge points and contexts; Positioning each structured tetrad in the text to be processed to obtain a text position corresponding to the structured tetrad; Based on the similarity between knowledge points in different structured tetrads, aggregating the structured tetrads meeting the similarity requirement and corresponding text positions to generate an index table corresponding to the knowledge points, wherein the index table is used for establishing and recording association relations between the knowledge points which are semantically related and distributed in different hierarchical structure units in the text to be processed; And generating a question-answer data set corresponding to the text to be processed based on the index table. Optionally, the candidate knowledge points are used as cores to reversely aggregate semantic contexts to generate a structured quadruple, wherein the method comprises the steps of extracting the knowledge points in each section through a key word extraction strategy of TF-IDF, and generating the structured quadruple based on the knowledge points in each section and the sections and the contexts where the knowledge points are located. Optionally, positioning each structured tetrad in the text to be processed to obtain a text position corresponding to the structured tetrad, wherein the positioning comprises determining a start-stop position of a knowledge point in the structured tetrad in the text to be processed through a three-level matching strategy; The three-level matching strategy comprises the steps of matching knowledge points in the structured tetrad in the text to be processed to obtain the start-stop position of the structured tetrad in the text to be processed, calculating similarity between the knowledge points in the structured tetrad and each text