CN-122020183-A - Data set construction method for training after hydropower construction of vertical large language model

CN122020183ACN 122020183 ACN122020183 ACN 122020183ACN-122020183-A

Abstract

The invention discloses a data set construction method for training after a vertical large language model, and belongs to the technical field of vertical large language model training. The technical problem to be solved is that the training data set has serious long tail data distribution and noise pollution problems after a drooping type large language model in the field of hydropower construction. The technical scheme is characterized in that a box diagram non-parameter outlier detection algorithm is adopted to regulate data distribution, quartiles and quartiles of sample frequency of a data set are calculated, outliers are identified, quantitative data enhancement is carried out on business-critical long-tail sparse outliers, unreinforced outliers such as noise and acquisition errors are directly cleared, and finally an optimized training data set is obtained.

Inventors

Ai Jianfei
WANG YONGWANG
LU ZHIMING
LI JINMIN
YANG XUE
SONG QING
GENG JIAMENG
SHENG GUOYU
WANG XING
LIU XIAONING
MA CHAO

Assignees

中国电建集团北京勘测设计研究院有限公司

Dates

Publication Date: 20260512
Application Date: 20260331

Claims (10)

1. A data set construction method for training after hydropower construction of a vertical large language model is characterized by comprising the following steps: S1, determining an original data set trained after a vertical large language model, extracting sample frequencies of all classes in the original data set to form a sample frequency set ; S2, calculating relevant statistics of the sample frequency set based on a box diagram algorithm, and identifying and separating outliers in the sample frequency set according to the statistics; s3, classifying and judging the outliers to obtain reinforced outliers and unreinforced outliers; S4, carrying out data enhancement processing on the sample category corresponding to the enhanced outlier, and carrying out clearing processing on the sample category corresponding to the non-enhanced outlier; S5, obtaining an optimized training data set of the vertical large language model after finishing processing.
2. The method for constructing a data set for training after constructing a vertical large language model according to claim 1, wherein in step S2, the correlation statistics include: A first quartile that is a 25 percentile of the set of sample frequencies and a third quartile that is a 75 percentile of the set of sample frequencies.
3. The method for constructing a dataset for training after constructing a vertical large language model according to claim 2, wherein in step S2, the specific process of identifying outliers according to the statistics is as follows: calculating the quartile distance between the first quartile and the third quartile, wherein the specific formula is as follows: Wherein, the Is a four-component bit distance, the four-component bit distance is equal to the four-component bit distance, For the first quartile of the number, A third quartile; Will then meet Or (b) A kind of electronic device And judging as an outlier.
4. The method for constructing a data set for training after constructing a vertical large language model according to claim 1, wherein in step S3, the specific way of classifying and judging the outlier is as follows: and judging whether the sample category corresponding to the outlier is a long tail scene of business key, generating a new sample with reasonable semantics and technical compliance, judging that the outlier can be enhanced if the new sample is met, and otherwise judging that the outlier cannot be enhanced.
5. The method for constructing a data set for training after constructing a vertical large language model according to claim 4, wherein in step S4, the requirement for performing data enhancement processing on the sample class corresponding to the enhanced outlier is: Increasing the sample frequency of the sample class to The sample frequency after the enhancement is obtained The specific expression is: 。
6. The method for constructing a data set for training after constructing a vertical large language model according to claim 5, wherein the calculation formula of the enhanced sample frequency is: Wherein, the To increase the number of samples; When (when) In the time-course of which the first and second contact surfaces, The minimum value of (2) is The specific calculation formula is as follows: Wherein, the Is a round-up operation.
7. The method for constructing a data set for training after constructing a vertical large language model according to claim 1, wherein in step S3, the sample class corresponding to the unreinforced outlier is a sample class corresponding to a false class generated by data noise, acquisition error, non-target domain content or log parsing error.
8. The method for constructing a training dataset after constructing a vertical large language model according to claim 7, wherein in step S4, the sample class corresponding to the unreinforced outlier is cleaned by cleaning the sample class corresponding to the unreinforced outlier in the original dataset D The specific formula is as follows: Wherein, the To discard sample class A subsequent data set.
9. The method for constructing a data set for training after hydropower construction of a vertical large language model according to claim 1, wherein the vertical large language model is a large language model in the field of hydropower construction business, and the business-critical long tail scene comprises a sample scene related to rare faults, novel technical application or small-scale equipment in hydropower construction.
10. The method for constructing a data set for training after constructing a vertical large language model according to claim 1, wherein the sample frequency is an effective sample number under each sample class in the original data set, and the effective sample is sample data matched with a training target of the vertical large language model.

Description

Data set construction method for training after hydropower construction of vertical large language model Technical Field The invention belongs to the technical field of training of a vertical large language model, and particularly relates to a data set construction method for training after hydropower construction of the vertical large language model. Background Along with the rapid development of large language model technology, the vertical large language model can be accurately adapted to the application requirements of professional business scenes of various industries, so that the research and development and landing processes in the field of hydropower construction business are continuously accelerated. The post training is used as a key link for optimizing professional suitability of the vertical large language model and improving reasoning and generating capacity of the model in a specific scene of industry, the quality of a training data set directly determines the final application effect of the vertical large language model, and the training data set becomes a core influence element for the research and the landing of the vertical large language model in the field of hydropower construction. When training work is carried out after a vertical large language model is carried out in the field of domestic and foreign hydropower construction business, the conventional technical mode in the industry is to directly take scene data actually collected by the business as a post-training data set, and the data set is not subjected to targeted distribution regulation and control and noise screening treatment aiming at the core requirement of model training. The related patent documents: publication number CN120633797a, publication date 2025.09.12, discloses a building rule field vertical large model training method, by constructing a dynamic rule knowledge base and a structured knowledge graph, constructing a dynamic data supply system of a multi-type building database and an incremental knowledge pool, providing training data for a model, and generating a building vertical field large model. The prior art represented by the foregoing documents has at least the following technical problems or drawbacks that have not been solved: (1) The problem of long tail distribution and noise pollution of a vertical large model training data set is not solved, and the related evidence is that the literature only builds a knowledge graph and an incremental data supply system, and the technical design of data set distribution regulation and control and noise sample removal is avoided. (2) The non-parameter detection method independent of normal data distribution is not provided, and related evidence is that a literature does not mention non-parameter algorithms such as a box diagram, quartiles and the like, and the non-parameter detection method cannot adapt to the actual characteristic of irregular distribution of vertical data. (3) The scheme is only suitable for the field of building regulations, and sample processing of long-tail scenes such as rare faults in the hydro-electric construction and novel technology is not involved. In view of this, the present invention has been made. Disclosure of Invention In order to solve the technical problems in the prior art, the invention provides a data set construction method for training after hydropower construction of a vertical large language model, and solves the problems of small number of long tail samples, noise data mixing, unbalanced data distribution and the like in the vertical large model data set. In order to achieve the above purpose, the technical scheme of the invention is as follows: a data set construction method for training after hydropower construction of a vertical large language model comprises the following steps: S1, determining an original data set trained after a vertical large language model, extracting sample frequencies of all classes in the original data set to form a sample frequency set ; S2, calculating relevant statistics of the sample frequency set based on a box diagram algorithm, and identifying and separating outliers in the sample frequency set according to the statistics; s3, classifying and judging the outliers to obtain reinforced outliers and unreinforced outliers; S4, carrying out data enhancement processing on the sample category corresponding to the enhanced outlier, and carrying out clearing processing on the sample category corresponding to the non-enhanced outlier; S5, obtaining an optimized training data set of the vertical large language model after finishing processing. Further, in step S2, the correlation statistic includes: A first quartile that is a 25 percentile of the set of sample frequencies and a third quartile that is a 75 percentile of the set of sample frequencies. Further, in step S2, the specific process of identifying the outlier according to the statistic is: calculating the quartile distance between the first q