CN-121996679-A - Large language model-oriented question and answer data generation method and device and computing equipment
Abstract
The invention discloses a question-answer data generation method for a large language model, which comprises the steps of obtaining a plurality of original technical documents in a plurality of document formats, converting each original technical document into a structured document in a unified format, extracting multi-level document structures and domain features from the structured document, cleaning the structured document according to the multi-level document structures and the domain features to obtain a structured cleaning document, constructing a domain context for the structured cleaning document based on a domain knowledge base, carrying out data enhancement on the structured cleaning document according to the domain context to obtain cleaning enhancement data, carrying out data expansion based on the cleaning enhancement data by using the large language model, and generating question-answer data. Based on this, a high-quality question-answer pair having field pertinence can be generated.
Inventors
- CHEN JIAN
- QIAO NAN
- Feng kaituo
Assignees
- 北京并行科技股份有限公司
- 北京北龙超级云计算有限责任公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260123
Claims (11)
- 1. A method of generating question-answer data for a large language model, executed in a computing device, comprising: acquiring a plurality of original technical documents in a plurality of document formats, and converting each original technical document into a structured document in a uniform format; extracting a multi-level document structure and field characteristics from the structured document, and cleaning the structured document according to the multi-level document structure and the field characteristics to obtain a structured cleaning document; building a domain context for the structured cleaning document based on a domain knowledge base, and carrying out data enhancement on the structured cleaning document according to the domain context to obtain cleaning enhancement data; and carrying out data expansion based on the cleaning enhancement data by utilizing a large language model, and generating question-answer data, wherein the question-answer data comprises a plurality of question-answer pairs, and the question-answer pairs comprise questions and corresponding answers.
- 2. The method of claim 1, wherein cleaning the structured document according to the multi-level document structure and the domain features comprises: generating a prompt word instruction according to the multi-level document structure and the field characteristics; and cleaning the structured document according to the prompt word instruction by using a large language model.
- 3. The method of claim 1 or 2, wherein extracting multi-level document structure and domain features from the structured document comprises: carrying out document structure identification on the structured document to obtain a multi-level document structure, wherein the multi-level document structure comprises a format layer, a structure layer and a content layer, the format layer comprises a header footer, a watermark and a page number, the structure layer comprises a catalog, a chapter, a abstract and a reference, and the content layer comprises a table, a formula and technical parameters; extracting domain features from the structured document, wherein the domain features comprise domain classification information, professional term labels, technical parameters and knowledge type marks.
- 4. A method as claimed in any one of claims 1 to 3, wherein using a large language model, data augmentation is performed based on the cleaning enhancement data and question-answer data is generated, comprising: extracting a plurality of words from the cleaning enhancement data by using a large language model, and generating corresponding synonymous or near-sense replacement words for each word, wherein the words comprise keywords and/or phrases; Generating a plurality of semantically similar text variants based on each word and the synonymous or near-synonymous alternative words corresponding to the words; A plurality of question-answer pairs is generated based on the plurality of semantically similar text variants.
- 5. The method of any one of claims 1-4, wherein the domain knowledge base comprises a domain classification system, a technical term system, a technical parameter system and a knowledge relationship graph, and the domain context comprises domain classification information, key concepts, technical parameters, related documents and domain constraints; Data enhancement of the structured cleaning document according to the domain context includes: And carrying out rewrite enhancement, expansion enhancement and multi-angle generation on the structured cleaning document according to the field context, wherein the rewrite enhancement comprises synonymous rewrite, visual angle conversion and detailed adjustment, the expansion enhancement comprises principle explanation, application scene, comparison analysis and parameter derivation, and the multi-angle generation comprises design angle generation, analysis angle generation, evaluation angle generation and application angle generation.
- 6. The method of any one of claims 1-5, further comprising: Carrying out quality evaluation on each question-answer pair according to a multi-dimensional evaluation index to obtain a corresponding quality evaluation result, wherein the multi-dimensional evaluation index comprises an accuracy evaluation index, an integrity evaluation index, a diversity evaluation index and an availability evaluation index; And updating the domain knowledge base according to the quality evaluation result, and fine-tuning the large language model.
- 7. The method of any of claims 1-6, further comprising: And generating a export file with a plurality of export formats based on the question and answer data to export the question and answer data, wherein the export files with the plurality of export formats comprise a CSV format, a OpenAI JSONL format and a JSON format.
- 8. The method of any one of claim 1 to 7, wherein, The plurality of document formats include DOCX format, PDF format and TXT format; the unified format is JSON format.
- 9. A question-answer data generation apparatus for deployment in a computing device, the apparatus comprising: The acquisition module is suitable for acquiring a plurality of original technical documents in a plurality of document formats and converting each original technical document into a structured document in a uniform format; The intelligent cleaning module is suitable for extracting a multi-level document structure and field characteristics from the structured document, and cleaning the structured document according to the multi-level document structure and the field characteristics to obtain a structured cleaning document; The data enhancement module is suitable for constructing a field context for the structured cleaning document based on a field knowledge base, and carrying out data enhancement on the structured cleaning document according to the field context to obtain cleaning enhancement data, wherein the field knowledge base comprises a field classification system, a professional term system, a technical parameter system and a knowledge relation map; And the question-answer generation module is suitable for carrying out data expansion based on the cleaning enhancement data by utilizing a large language model and generating question-answer data, wherein the question-answer data comprises a plurality of question-answer pairs, and the question-answer pairs comprise questions and corresponding answers.
- 10. A computing device, comprising: At least one processor, and A memory storing program instructions, wherein the program instructions are configured to be adapted to be processed by the at least one processor, the program instructions comprising instructions for processing the method of any of claims 1-8.
- 11. A computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of claims 1-8.
Description
Large language model-oriented question and answer data generation method and device and computing equipment Technical Field The invention relates to the technical field of natural language processing, in particular to a question-answer data generation method, a question-answer data generation device and computing equipment for a large language model. Background In the development process in the technical field of profession, a large number of technical documents including academic research papers, engineering design documents, experimental reports, patent documents, and the like need to be processed. These documents are often characterized by heterogeneous multisource, complex structure, high noise data, strong field specialization, etc. At present, the technology for document cleaning and question-answer pair generation mainly has the following problems: 1) Limitations of the general cleaning method. The existing document cleaning tool mainly aims at a general document, simply removes format information such as header footers, watermarks and the like, cannot identify and process special content structures (such as professional data tables, parameter charts, formula deducing processes and the like) in the professional field, lacks semantic understanding for multi-disciplinary crossed content, and is easy to delete valuable technical information or preserve irrelevant content by mistake. 2) Data enhancement lacks domain adaptation. The general LLM data enhancement method cannot accurately understand the concept of the professional field, and the generated enhancement data may contain a factual error. 3) Question and answer pairs are of low quality. The generated questions lack field pertinence, core knowledge points are difficult to cover, the types of the questions are single, and the accuracy of answers is difficult to guarantee, particularly when quantitative parameters and technical details are involved. 4) Quality assessment and feedback mechanisms are lacking. The generated questions and answers cannot be automatically evaluated, the accuracy and the practicability in the professional field are improved, a check and error correction mechanism of a field expert knowledge base is lacked, and iterative optimization and continuous improvement are difficult to carry out. In view of this, a method for generating question-answer data for a large language model is needed to solve the problems in the above technical solutions. Disclosure of Invention Therefore, the invention provides a method and a device for generating question-answer data oriented to a large language model, so as to solve or at least alleviate the problems. According to one aspect of the invention, a large language model-oriented question and answer data generation method is provided and executed in a computing device, and comprises the steps of obtaining a plurality of original technical documents in a plurality of document formats, converting each original technical document into a structured document in a unified format, extracting a multi-level document structure and domain features from the structured document, cleaning the structured document according to the multi-level document structure and the domain features to obtain a structured cleaning document, constructing a domain context for the structured cleaning document based on a domain knowledge base, and carrying out data enhancement on the structured cleaning document according to the domain context to obtain cleaning enhancement data, and carrying out data expansion based on the cleaning enhancement data by utilizing a large language model, and generating question and answer data, wherein the question and answer data comprises a plurality of question and answer pairs. Optionally, in the large language model-oriented question-answer data generation method, the structured document is cleaned according to the multi-level document structure and the domain features, and the method comprises the steps of generating a prompt word instruction according to the multi-level document structure and the domain features, and cleaning the structured document according to the prompt word instruction by utilizing a large language model. Optionally, in the large language model-oriented question-answer data generation method, the method comprises the steps of carrying out document structure identification on the structured document to obtain a multi-level document structure, wherein the multi-level document structure comprises a format layer, a structure layer and a content layer, the format layer comprises a header footer, a watermark and a page number, the structure layer comprises a catalog, a chapter, a abstract and a reference, the content layer comprises a table, a formula and a technical parameter, and the field feature is extracted from the structured document, wherein the field feature comprises field classification information, a professional term label, a technical parameter and a knowledge