Search

CN-122021738-A - Method, device, program product and storage medium for constructing large model in ship industry

CN122021738ACN 122021738 ACN122021738 ACN 122021738ACN-122021738-A

Abstract

A method, equipment, program product and storage medium for constructing a large model in the ship industry relate to the technical field of artificial intelligence and natural language processing. The method comprises the steps of obtaining multi-source heterogeneous original data of the ship industry, performing differential analysis on the multi-source heterogeneous original data according to a data format of the multi-source heterogeneous original data to obtain analyzed data, performing multi-layer data treatment on the analyzed data to obtain treated data, proportioning the treated data based on a preset multi-dimensional data proportioning rule to obtain a ship industry training corpus, performing incremental pre-training on a preset basic large model according to the ship industry training corpus to obtain a target basic model, constructing an instruction fine-tuning data set based on a preset ship industry downstream task library, and adjusting the target basic model according to the instruction fine-tuning data set to obtain the ship industry large model. By implementing the technical scheme provided by the application, the accuracy of the large model in the ship industry on a specific task can be improved.

Inventors

  • Hong Hongyan
  • Gui Aoran
  • PENG CHENYANG
  • WEI ZHIWEI
  • LI QIHAN
  • MIAO JING
  • Liu Ersen
  • ZENG XIAOGUANG
  • Lv Xiaohe
  • CHEN KAI
  • YANG YANQI
  • WU JIANKUN
  • CHEN BAIQUAN

Assignees

  • 北京中船咨询有限公司

Dates

Publication Date
20260512
Application Date
20251125

Claims (10)

  1. 1. The method for constructing the large model in the ship industry is characterized by comprising the following steps of: Acquiring multi-source heterogeneous original data in the ship industry, and performing differential analysis on the multi-source heterogeneous original data according to the data format of the multi-source heterogeneous original data to obtain analyzed data; performing multi-layer data management on the analyzed data to obtain treated data, and proportioning the treated data based on a preset multi-dimensional data proportioning rule to obtain a ship industry training corpus; performing incremental pre-training on a preset basic large model according to the ship industry training corpus to obtain a target basic model; And constructing an instruction fine adjustment data set based on a preset downstream task library of the ship industry, and adjusting the target basic model according to the instruction fine adjustment data set to obtain a large model of the ship industry.
  2. 2. The method for constructing a large model of the marine industry according to claim 1, wherein the performing multi-layer data governance on the parsed data to obtain governance data, and proportioning the governance data based on a preset multi-dimensional data proportioning rule to obtain a marine industry training corpus comprises: Performing basic cleaning treatment on the analyzed data to obtain cleaned data, and performing preset sensitive word filtering on the cleaned data to obtain desensitized data; Calculating the similarity and quality score between the desensitized data, and performing first screening on the desensitized data according to the similarity and the quality score to obtain optimized data; Calculating the duty ratio of each language in the optimized data, performing second screening on the optimized data according to each duty ratio to obtain screened data, and marking the screened data according to preset data dimensions to obtain marked corpus; proportioning the marked corpus based on a preset multidimensional data proportioning rule to obtain proportioned corpus; And verifying the data duty ratio of each preset data dimension in the matched corpus, and adjusting the data duty ratio when the deviation of the data duty ratio exceeds a preset threshold value to obtain the ship industry training corpus.
  3. 3. The method for constructing a large model of the marine industry according to claim 2, wherein the calculating the similarity and the quality score between the desensitized data, and the first screening the desensitized data according to the similarity and the quality score, to obtain optimized data, includes: Calculating text fingerprints of the desensitized data, and determining at least one similarity group according to the text fingerprints, wherein the similarity group at least comprises two target desensitized data; acquiring the text length and the domain keyword duty ratio of the target desensitized data in each similar group, and calculating the quality score of the target desensitized data in each similar group according to the text length and the domain keyword duty ratio; and reserving target desensitized data with the highest quality scores in the similarity groups to obtain optimized data.
  4. 4. The method for constructing a large model of the ship industry according to claim 1, wherein the incremental pre-training of the preset basic large model according to the ship industry training corpus to obtain a target basic model comprises the following steps: Selecting corpus data meeting preset quality standards from the ship industry training corpus, and cutting off or supplementing each corpus data according to the length of a preset sequence to obtain preprocessed corpus data; Dividing the preprocessed corpus data into a plurality of data blocks, wherein each data block corresponds to one language, injecting preset field data of each language into the corresponding data block according to a preset proportion, and randomly sequencing indexes of each data block to obtain optimized training data; setting an incremental pre-training parameter according to preset super-parameter configuration, and performing incremental pre-training on a preset basic large model by using the optimized training data and the incremental pre-training parameter; And calculating the confusion degree of the basic large model on an increment verification set every preset steps in the increment pre-training process, and stopping increment pre-training when the confusion degree is not reduced for a preset number of times continuously, so as to obtain a target basic model.
  5. 5. The method for constructing a large model of the marine industry according to claim 1, wherein the constructing an instruction trim data set based on a preset downstream task library of the marine industry, and adjusting the target base model according to the instruction trim data set, to obtain the large model of the marine industry, comprises: Collecting task original data related to a downstream task of the ship industry in a preset downstream task library of the ship industry, and converting the task original data into an instruction-input-output format to obtain formatted instruction data; performing quality screening and data enhancement processing on the formatted instruction data to obtain an instruction fine adjustment data set; dividing the instruction fine tuning data set into a training set and a verification set according to a preset proportion; performing instruction fine tuning training on the target basic model according to preset instruction fine tuning parameters and the training set; And in the instruction fine tuning training process, performing performance evaluation on the target basic model by using the verification set to obtain a performance index, and stopping the instruction fine tuning training when the performance index reaches a preset standard to obtain a large model of the ship industry.
  6. 6. The method for constructing a large model of the ship industry according to claim 1, further comprising, after the large model of the ship industry is obtained: When a user task request is received, retrieving ship industry knowledge information related to the user task request from a preset ship industry knowledge base, and combining the ship industry knowledge information and the user task request and inputting the combined ship industry knowledge information and the user task request into the ship industry large model to obtain a task response based on the ship industry knowledge information; And calling a corresponding business processing flow to process a downstream task of the ship industry according to the content type of the task response, and generating a task result conforming to the ship industry specification.
  7. 7. The method for building a large model of the marine industry according to claim 6, wherein after generating the task result conforming to the marine industry specification, further comprises: Constructing an evaluation data set based on the downstream task library of the ship industry; performing performance evaluation on the ship industry large model by using the evaluation data set to obtain a comprehensive evaluation result; And formulating an optimization scheme according to the comprehensive evaluation result, and optimizing the large model of the ship industry according to the optimization scheme until the comprehensive evaluation result reaches a preset standard.
  8. 8. A construction equipment of a marine industry large model, characterized in that it comprises one or more processors and a memory, the memory being coupled to the one or more processors, the memory being for storing computer program code comprising computer instructions, the one or more processors invoking the computer instructions to cause the construction equipment of a marine industry large model to perform the method according to any of claims 1-7.
  9. 9. A computer program product comprising instructions which, when run on a construction equipment for a large model of the marine industry, cause the construction equipment for a large model of the marine industry to perform the method according to any of claims 1-7.
  10. 10. A computer readable storage medium comprising instructions which, when run on a construction equipment for a marine industry large model, cause the construction equipment for a marine industry large model to perform the method according to any of claims 1-7.

Description

Method, device, program product and storage medium for constructing large model in ship industry Technical Field The application relates to the technical field of artificial intelligence and natural language processing, in particular to a method, equipment, a program product and a storage medium for constructing a large model in the ship industry. Background With the rapid development of global trade and the continuous growth of marine economy, the marine industry is facing unprecedented digital transformation challenges as an important carrier for international trade. The modern ship industry relates to a plurality of complex links of ship design, manufacture, operation, maintenance and the like, and generates a large amount of technical documents, operation manuals, regulation standards, fault records and other professional data. Meanwhile, the artificial intelligence technology, particularly the rapid development of a large-scale language model, provides a new technical path for the intelligent upgrading of various industries. Currently, a variety of generic large language models are applied to different vertical industry scenarios in the market. These models are usually pre-trained based on massive amounts of internet text data, with good language understanding and generating capabilities. In practice, researchers and businesses often employ fine tuning techniques to adapt a generic large model to a particular industry's tasks in an effort to achieve better field adaptability and task performance. However, existing large models have significant limitations in marine applications. The ship industry has various data sources and complex formats, including technical drawings, inspection reports, operation logs, regulation files, equipment manuals and the like, and the data often have different structuring degrees and quality levels, and the data quality is uneven. In the prior art, a unified data processing mode is adopted to integrate the complex and various industry data, so that the data quality and the proportion of the model input are unreasonable, and the accuracy of the model on the specific tasks of the ship industry is insufficient. Disclosure of Invention The application provides a method, equipment, a program product and a storage medium for constructing a large model in the ship industry, which can improve the accuracy of the large model in the ship industry on specific tasks. In a first aspect of the present application, a method for constructing a large model in the marine industry is provided, specifically including: Acquiring multi-source heterogeneous original data in the ship industry, and performing differential analysis on the multi-source heterogeneous original data according to the data format of the multi-source heterogeneous original data to obtain analyzed data; performing multi-layer data management on the analyzed data to obtain treated data, and proportioning the treated data based on a preset multi-dimensional data proportioning rule to obtain a ship industry training corpus; performing incremental pre-training on a preset basic large model according to the ship industry training corpus to obtain a target basic model; And constructing an instruction fine adjustment data set based on a preset downstream task library of the ship industry, and adjusting the target basic model according to the instruction fine adjustment data set to obtain a large model of the ship industry. Through adopting above-mentioned technical scheme, at first through carrying out differentiation analysis to the heterogeneous raw data of multisource, effectively handled the analytic problem of different format data to through multilayer data management and the processing based on multidimensional data ratio rule, promoted the data quality and the ratio rationality of training corpus. And secondly, incremental pre-training is carried out on the basic large model based on a high-quality ship industry training corpus, so that the model has a firm ship field knowledge base, and finally, the performance of the model on a specific task of the ship industry is further enhanced through fine adjustment based on instructions of a preset downstream task library, thereby effectively improving the accuracy of the ship industry large model in practical application. Optionally, the performing multi-layer data management on the parsed data to obtain treated data, and matching the treated data based on a preset multi-dimensional data matching rule to obtain a training corpus of the ship industry, including: Performing basic cleaning treatment on the analyzed data to obtain cleaned data, and performing preset sensitive word filtering on the cleaned data to obtain desensitized data; Calculating the similarity and quality score between the desensitized data, and performing first screening on the desensitized data according to the similarity and the quality score to obtain optimized data; Calculating the duty ratio of each langua