CN-122019703-A - Large model customized training method and system for industry application

CN122019703ACN 122019703 ACN122019703 ACN 122019703ACN-122019703-A

Abstract

The invention relates to the technical field of large model training, in particular to a large model customization training method and system for industrial application, which enable a model to systematically master the internal organization architecture and complex association of industrial knowledge by constructing a hierarchical knowledge graph, introduce association prediction auxiliary tasks based on the graph in the training process and integrate the tasks into trunk semantic understanding, and simultaneously ensure the gradual progress and stable convergence of model learning by adopting a dynamic course learning strategy; finally, optimizing the loss function by fusing knowledge association constraint, so that the model not only deeply understands a hierarchical system of industry knowledge, but also reliably generates answers which accord with industry specifications, are strict in logic and have strong specialization, and in this way, the technical problems that the prior art is difficult to deeply understand and follow an internal hierarchical structure and complex association relation of domain knowledge, and the output of the model is insufficient in specialization, consistency and logic are solved.

Inventors

HAN DAPENG
XIA JUNJIE
ZHANG MINGXU
ZHANG HAONAN

Assignees

重庆高斯智算科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251229

Claims (8)

1. A large model customizing training method for industrial application is characterized in that, Comprises the following steps: acquiring and preprocessing an original corpus of a specific industry by adopting a mixed protection mechanism based on a trusted execution environment and differential privacy, and carrying out structural processing on the original corpus to construct an initial training corpus set containing question and answer pairs; Based on the initial training corpus, identifying and extracting the travel knowledge units and the association relations between the travel knowledge units, and constructing a hierarchical knowledge map reflecting knowledge dependence and association structures; Pre-training a basic large language model based on the hierarchical knowledge graph, wherein the training process comprises an auxiliary task, so that the model learns and predicts the association strength of any two knowledge units in the graph, and the association information is integrated into a trunk semantic understanding and generating task of the model; In the pre-training and subsequent supervision fine tuning stage, a dynamic course learning strategy is adopted, and the difficulty and range of training samples of an input model are dynamically adjusted from shallow to deep and from simple to complex according to the depth of a knowledge unit in the hierarchical knowledge graph and the associated complexity of the knowledge unit; and carrying out iterative optimization on the model through a loss function fused with knowledge association constraint, and finally obtaining a customized large model capable of understanding the internal hierarchical structure of industry knowledge and generating answers conforming to industry specifications according to the internal hierarchical structure.
2. A large model customized training method for industrial applications as claimed in claim 1, The specific mode for acquiring and preprocessing the original corpus of a specific industry by adopting a mixed protection mechanism based on a trusted execution environment and differential privacy is as follows: In the data uploading stage, the data provider carries out vectorization and feature extraction on the original corpus in a local trusted execution environment enclave, after intermediate feature representation is generated, the intermediate feature representation is converted into ciphertext features by using a distance-preserving encryption algorithm and is uploaded to a cloud vector database, and the original plaintext data is ensured not to leave a local safety boundary forever; And in the spectrum construction link of the model training stage, applying a differential privacy algorithm to entity relations extracted from the corpus, and adding calibrated Laplacian noise into forward relation weights of the hierarchical knowledge spectrum so as to furthest maintain the global structural utility of the spectrum while protecting the individual data association privacy.
3. A method of customizing a training for an industrial application as claimed in claim 2, In the step of identifying and extracting the travel knowledge units and the association relations between the travel knowledge units based on the initial training corpus to construct a hierarchical knowledge map reflecting knowledge dependence and association structures, Extracting a core knowledge entity from the initial training corpus by using an industry term dictionary and a named entity recognition technology as a basic knowledge unit; Based on the context co-occurrence information of the corpus, the dependency syntax analysis and the predefined ontology rules, various semantic relations among basic knowledge units are mined, including but not limited to father-child relations, component relations, attribute association relations and causal reasoning relations; And constructing a weighted hierarchical knowledge graph by taking the knowledge units as nodes and semantic relations as directed edges, wherein the weights of the edges are initialized to association strength values calculated based on statistics or rules.
4. A method of customizing a training for an industrial application as claimed in claim 3, The auxiliary task is realized by an independent association prediction module which takes context coding vectors of the two knowledge units as input, outputs a scalar predicted value and calculates loss through association strength true value predefined or dynamically updated in the atlas, thereby guiding the model to implicitly learn the topological structure of industry knowledge.
5. A method of customizing a training for an industrial application as claimed in claim 4, In the step of 'in the pre-training and the subsequent supervision fine tuning stage, a dynamic course learning strategy is adopted, the training sample difficulty and range of the input model are dynamically adjusted from shallow to deep and from simple to complex according to the depth of the knowledge units in the hierarchical knowledge graph and the associated complexity thereof', The training process is divided into a plurality of stages, each stage corresponds to one training subset, the construction rule of the training subset is based on the hierarchical knowledge graph, knowledge units and relevant corpora with simple association relations at a shallow level of the graph are selected as main training samples at the early stage, and knowledge units and relevant corpora with complex association with other knowledge units at a deeper level are gradually brought in along with training, so that a dynamic evolution training data stream is formed.
6. A method of customizing a training for an industrial application as claimed in claim 5, The training subset is constructed by calculating a comprehensive complexity score for each knowledge unit according to a hierarchical knowledge graph, wherein the score is a function of depth values of the knowledge units in the graph, the number of edges directly connected with the knowledge units and average weights of the connected edges, dividing the knowledge units into a plurality of ordered difficulty levels according to comprehensive complexity score distribution of all the knowledge units, and distributing a target difficulty level interval for each training stage, so that a knowledge unit range related to the training sample of the stage is determined.
7. The business application large model customized training method of claim 6, In the step of 'in the pre-training and the subsequent supervision fine tuning stage, a dynamic course learning strategy is adopted, the training sample difficulty and range of the input model are dynamically adjusted from shallow to deep and from simple to complex according to the depth of the knowledge units in the hierarchical knowledge graph and the associated complexity thereof', And if the mastering degree of the model on a certain deep or complex associated knowledge unit is lower than a preset threshold value, automatically backtracking and mixing more training samples of the pre-arranged or associated basic knowledge to realize the self-adaptive difficulty adjustment and knowledge consolidation.
8. A business application oriented large model custom training system for executing the business application oriented large model custom training method of claim 7, Comprising the following steps: the data security preprocessing module is used for acquiring and preprocessing an original corpus to construct an initial training corpus set through a mixed protection mechanism based on a trusted execution environment and differential privacy; The hierarchical knowledge graph construction module is used for identifying and extracting knowledge units and association relations from the initial training corpus, and constructing and storing a hierarchical knowledge graph; The model training engine module is used for loading a basic large language model and executing pre-training and supervision fine adjustment comprising auxiliary tasks based on the hierarchical knowledge graph; The dynamic course learning scheduling module is used for dynamically planning sample difficulty and range of different training stages according to the hierarchical knowledge graph and transmitting training data to the model training engine module; And the optimizer module is used for calculating a loss function fused with knowledge association constraint and guiding iterative update of model parameters.

Description

Large model customized training method and system for industry application Technical Field The invention relates to the technical field of large model training, in particular to a large model customized training method and system for industrial application. Background In recent years, with the rapid development of a large-scale pre-trained language model, the application potential of the model in various vertical industry fields is increasingly prominent. The large model of industry customization can deeply understand the technical terms, business processes and knowledge systems in the specific field, thereby providing more accurate and reliable analysis, decision support and content generation services, and becoming one of key technologies for promoting the digitized transformation and intelligent upgrading of the industry. Therefore, how to efficiently and safely utilize industry private data, training a special model which can grasp deep industry knowledge and follow field specifications and logic, becomes a focus of common attention of the current industry and academia. However, the prior art has difficulty in deeply understanding and following the inherent hierarchical structure and complex association relation of the domain knowledge, so that the output of the prior art has defects in the aspects of specialty, consistency and logic. Disclosure of Invention The invention aims to provide a large model customized training method and system for industrial application, which solve the technical problems that the prior art is difficult to understand deeply and follow the internal hierarchical structure and complex association relation of field knowledge, and the output of the method is insufficient in the aspects of specialty, consistency and logicality. In order to achieve the above object, the present invention provides a large model customized training method for industrial application, comprising the following steps: acquiring and preprocessing an original corpus of a specific industry by adopting a mixed protection mechanism based on a trusted execution environment and differential privacy, and carrying out structural processing on the original corpus to construct an initial training corpus set containing question and answer pairs; Based on the initial training corpus, identifying and extracting the travel knowledge units and the association relations between the travel knowledge units, and constructing a hierarchical knowledge map reflecting knowledge dependence and association structures; Pre-training a basic large language model based on the hierarchical knowledge graph, wherein the training process comprises an auxiliary task, so that the model learns and predicts the association strength of any two knowledge units in the graph, and the association information is integrated into a trunk semantic understanding and generating task of the model; In the pre-training and subsequent supervision fine tuning stage, a dynamic course learning strategy is adopted, and the difficulty and range of training samples of an input model are dynamically adjusted from shallow to deep and from simple to complex according to the depth of a knowledge unit in the hierarchical knowledge graph and the associated complexity of the knowledge unit; and carrying out iterative optimization on the model through a loss function fused with knowledge association constraint, and finally obtaining a customized large model capable of understanding the internal hierarchical structure of industry knowledge and generating answers conforming to industry specifications according to the internal hierarchical structure. The specific mode for acquiring and preprocessing the original corpus of a specific industry by adopting a mixed protection mechanism based on a trusted execution environment and differential privacy is as follows: In the data uploading stage, the data provider carries out vectorization and feature extraction on the original corpus in a local trusted execution environment enclave, after intermediate feature representation is generated, the intermediate feature representation is converted into ciphertext features by using a distance-preserving encryption algorithm and is uploaded to a cloud vector database, and the original plaintext data is ensured not to leave a local safety boundary forever; And in the spectrum construction link of the model training stage, applying a differential privacy algorithm to entity relations extracted from the corpus, and adding calibrated Laplacian noise into forward relation weights of the hierarchical knowledge spectrum so as to furthest maintain the global structural utility of the spectrum while protecting the individual data association privacy. Wherein, in the step of identifying and extracting the travel knowledge units and the association relations between the travel knowledge units based on the initial training corpus to construct a hierarchical knowledge graph reflecting knowledge depend