CN-122019729-A - Method, device and system for generating question-answer pairs based on portable document

CN122019729ACN 122019729 ACN122019729 ACN 122019729ACN-122019729-A

Abstract

The application provides a method, a device and a system for generating question-answer pairs based on a portable document. The method comprises the steps of constructing a structured data set corresponding to the portable document, dividing the structured data set based on the title hierarchy to obtain a first text block set, and generating question-answer pairs corresponding to the text blocks based on each text block in the first text block set, wherein each question-answer pair consists of a question and an answer corresponding to the question. By using the method provided by the embodiment of the application, the question-answer pair with high quality can be generated based on the portable document.

Inventors

LIU XIN
WEI XIAOXUAN

Assignees

深圳市元征科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260317

Claims (10)

1. A method for generating a question-answer pair based on a portable document, comprising: Constructing a structured data set corresponding to the portable document, wherein the structured data set comprises a title level of the portable document; Dividing the structured data set based on the title hierarchy to obtain a first text block set; and generating question-answer pairs corresponding to the text blocks based on each text block in the first text block set, wherein each question-answer pair consists of a question and an answer corresponding to the question.
2. The method of claim 1, wherein the structured dataset comprises non-textual information for the portable document.
3. The method of claim 1, wherein the partitioning the structured dataset based on the title hierarchy results in a first set of text blocks, comprising: Dividing the structured data set by taking the title level as a boundary to obtain a basic text block set; Calculating the semantic similarity of texts in the text blocks aiming at each text block in the basic text block set; if the basic text block set does not comprise text blocks meeting a first condition, determining the basic text block set as the first text block set; If the basic text block set comprises text blocks meeting a first condition, re-dividing the structured data set to obtain the first text block set; the first condition comprises that the semantic similarity corresponding to the text block comprises the semantic similarity lower than a preset threshold value.
4. The method of claim 1, wherein the partitioning the structured dataset based on the title hierarchy results in a first set of text blocks, comprising: Dividing the structured data set by taking the title level as a boundary to obtain a basic text block set; and merging the first text block and the second text block in the basic text block set to obtain the first text block set, wherein the first text block and the second text block are adjacent, and the first text block or the second text block is an empty text block.
5. The method of claim 1, wherein the generating, based on each text block in the first set of text blocks, a question-answer pair corresponding to the text block comprises: For each text block in the first set of text blocks, generating a question-answer pair associated with metadata of the text block based on the text block and metadata of the text block, wherein the metadata of the text block includes information indicating a source of the text block and/or information indicating an attribute of the text block.
6. The method of claim 1, wherein the generating the question-answer pairs corresponding to the text blocks comprises one or more of: Dividing the text block into a plurality of sub-text blocks under the condition that the text length of the text block is greater than or equal to a preset length, and generating question-answer pairs corresponding to the sub-text blocks aiming at each sub-text block in the plurality of sub-text blocks; And generating a first number of question-answer pairs corresponding to the text blocks, wherein the first number meets the condition that the first number is related to the text length of the text blocks and/or is larger than or equal to a preset number.
7. The method of claim 1, wherein after the generating the question-answer pair corresponding to the text block, the method further comprises: and performing quantitative check and qualitative check on the question-answer pair.
8. The method of any of claims 1-7, wherein after the generating the question-answer pair corresponding to the text block, the method further comprises: Determining question-answer pairs similar to the questions in the question-answer pairs corresponding to the text blocks; calculating the semantic similarity of answers in question-answer pairs similar to the questions; determining answer pairs similar to the questions from the answer pairs similar to the questions based on the semantic similarity; And eliminating one or more question-answer pairs from the question-answer pairs similar to the answer.
9. An apparatus for generating question-answer pairs based on a portable document, the apparatus comprising: the construction module is used for constructing a structured data set corresponding to the portable document, wherein the structured data set comprises a title level of the portable document; The dividing module is used for dividing the structured data set based on the title level to obtain a first text block set; and the generating module is used for generating question-answer pairs corresponding to the text blocks based on each text block in the first text block set, wherein each question-answer pair consists of a question and an answer corresponding to the question.
10. A system for generating a question-answer pair based on a portable document, comprising: a memory for storing codes; a processor for executing code stored by the memory to perform the method of any one of claims 1 to 8.

Description

Method, device and system for generating question-answer pairs based on portable document Technical Field The application belongs to the field of data processing, and particularly relates to a method, a device and a system for generating question-answer pairs based on a portable document. Background Currently, there are many scenarios in which question-answer pairs need to be generated based on portable documents (e.g., portable document format (portable document format, PDF) documents). How to generate high-quality question-answer pairs based on portable documents is a problem to be solved. Disclosure of Invention The embodiment of the application provides a method, a device and a system for generating question-answer pairs based on a portable document, which can generate high-quality question-answer pairs based on the portable document. The first aspect of the embodiment of the application provides a method for generating question-answer pairs of a portable document, which comprises the steps of constructing a structured data set corresponding to the portable document, dividing the structured data set based on a title level of the portable document to obtain a first text block set, and generating question-answer pairs corresponding to the text blocks based on each text block in the first text block set, wherein each question-answer pair consists of a question and an answer corresponding to the question. In some implementations, the structured dataset includes non-textual information for the portable document. In some implementations, the partitioning the structured data set based on the title hierarchy to obtain a first text block set includes partitioning the structured data set with the title hierarchy as a boundary to obtain a basic text block set, calculating semantic similarity of texts in the text blocks for each text block in the basic text block set, determining the basic text block set as the first text block set if the basic text block set does not include text blocks meeting a first condition, and re-partitioning the structured data set to obtain the first text block set if the basic text block set includes text blocks meeting a first condition, wherein the first condition includes that the semantic similarity corresponding to the text blocks includes semantic similarity lower than a preset threshold. In some implementations, the partitioning the structured data set based on the title hierarchy to obtain a first text block set includes partitioning the structured data set with the title hierarchy as a boundary to obtain a basic text block set, merging a first text block and a second text block in the basic text block set to obtain the first text block set, wherein the first text block and the second text block are adjacent, and the first text block or the second text block is an empty text block. In some implementations, the generating of the question-answer pair corresponding to the text blocks based on each text block in the first set of text blocks includes generating, for each text block in the first set of text blocks, a question-answer pair associated with metadata of the text block based on the text block and metadata of the text block, wherein the metadata of the text block includes information indicating a source of the text block and/or information indicating an attribute of the text block. In some implementations, the generating the question-answer pairs corresponding to the text blocks includes one or more of dividing the text blocks into a plurality of sub-text blocks and generating the question-answer pairs corresponding to the sub-text blocks for each of the plurality of sub-text blocks if the text length of the text block is greater than or equal to a preset length, and generating a first number of question-answer pairs corresponding to the text blocks, wherein the first number satisfies that the first number is related to the text length of the text blocks and/or the first number is greater than or equal to the preset number. In some implementations, after the generating of the question-answer pair corresponding to the text block, the method further includes performing a quantitative checksum quality check on the question-answer pair. In some implementations, after the generating of the question-answer pairs corresponding to the text blocks, the method further includes determining question-answer pairs similar to questions in the question-answer pairs corresponding to the text blocks, calculating semantic similarity of answers in the question-similar question-answer pairs, determining answer pairs similar to answers from the question-similar question-answer pairs based on the semantic similarity, and eliminating one or more answer pairs from the answer-similar question-answer pairs. The second aspect of the embodiment of the application provides a device for generating question-answer pairs based on a portable document, which comprises a construction module, a dividing module a