CN-120872408-B - Construction method and system for producing available code interpretation large model

CN120872408BCN 120872408 BCN120872408 BCN 120872408BCN-120872408-B

Abstract

The invention provides a method and a system for constructing a code interpretation large model for production, wherein the method comprises the steps of selecting a Base model for training the code interpretation large model, selecting a scoring tool for evaluating the final effect of the code interpretation large model from a service angle, constructing a code interpretation knowledge graph based on GraphRAG, and constructing the code interpretation large model based on a post-training mode of SFT fine adjustment. The method can facilitate developers to better master the logic of the source code from the technical perspective, and simultaneously facilitate business staff to understand the whole end-to-end logic from the dimension of the transaction, thereby providing the bottom layer capability for enabling the scenes such as the code reading assistant AI Agent product, the bidirectional synchronization of the energy development process platform asset and the code asset, and the like.

Inventors

WANG WEIWEI
ZOU WEIJIE
MIN JIYONG

Assignees

深圳市长亮科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20250716

Claims (5)

1. A method of constructing a large model for producing a usable code interpretation, the method comprising: selecting a basic model for training a code interpretation large model; Selecting a scoring tool for evaluating the final effect of the code interpretation large model from the service angle; Constructing a code interpretation knowledge graph based on GraphRAG, namely selecting parquet files meeting requirements from an original catalog output by GraphRAG to a GraphRAG output catalog after selection, converting all parquet files to csv files in batches from the GraphRAG output catalog after selection, preprocessing data sources of joint_ texu _units_to_entries_ids.csv and joint_ texu _units_to_ relations _hip_ids.csv serving as Edge edges, enabling data formats to be adapted to the importing of nebula-importer tools, calling nebula-importer tools by a Python program based on each yaml configuration file, importing the csv data to NebulaGraph, replacing all replacers of NebulaGraph imported template files, and converting the same to configuration files interpretable by importing tools nebula-importer for use by the nebula-importer tools; Based on the basic model, a post-training mode of SFT fine tuning is adopted to construct a code interpretation large model, which comprises adopting an open-source large model fine tuning training framework LlamaFactory to combine an instruction supervision fine tuning data set to carry out SFT fine tuning on the selected qwen2-72b-instruct basic model, leading the basic model to learn an answer mode in the instruction supervision fine tuning data set, namely, the interpretation mode and the answer format of codes to obtain a the first edition code interpretation large model, carrying out a question-answering experiment of code interpretation on the the first edition code interpretation large model by using a test set, finding out badcase appearing in the question-answering and answers which do not accord with the expectations of service specialists, simultaneously obtaining standard answers corresponding to questions by using the questions, badcase and standard answers, carrying out dpo fine tuning on the the first edition code interpretation large model by combining the open-source large model fine tuning framework LlamaFactory with the preference data set, leading the code interpretation large model to learn the preference of users to obtain the dpo code interpretation large model; Solving the repeater problem when code interpretation large model reasoning by dpo punishment repetition comprises: The method comprises the steps of performing generalization on an example of a repeated repeater, setting a correct answer chosen and a repeated answer rejected of the example of the repeated badcase into a dpo data set because the example after the generalization is repeated, constructing a dpo preference data set, setting a standard answer, namely a non-repeated output, as chosen, setting repeated output of the repeater as rejected for the example after the generalization, completing the method by constructing a counter example of which the repetition number is a power of 2, and performing a dpo-optimized code interpretation large model by combining a LlamaFactory fine tuning frame with the dpo preference data set based on the code interpretation large model of the current repeater problem.
2. The method of claim 1, wherein the repeater problem when the code interpretation large model reasoning is solved by means of a beam search.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, The cluster search use_beam_search=true, the temperature coefficient temperature=0, the kernel sampling threshold value top_p=1, the candidate set size top_k= -1, the candidate sequence number best_of=5, and early termination of the early_stop=true are configured in the super parameter.
4. A method according to any of claims 1-3, wherein the structure of the code interpretation large model comprises: The input data comprises source codes and XML configuration files, wherein the source codes are java files for realizing standards of business logic, service functions, exception handling and data interaction, and the XML configuration files, namely metadata models, refer to XML files for defining transaction flow, service functions, table structures and types of error code element data in projects; the method comprises the steps of outputting data, a PlantUML format text, a standard PlantUML flow chart grammar description, draw.io structured data, manim animation script, a Python code file and an manim library, wherein the draw.io structured data accords with the data format of draw.io XML or JSON standard; The method comprises the steps of preprocessing a semi-structured text, standardizing the semi-structured text, extracting core events, actions and object relations, extracting keywords and disassembling the structure based on regular regex, enabling a semantic construction layer to call a LLM model to generate an interpretation result from the preprocessed structured text content, enabling a format mapping layer to map the text into grammar structures required by PlantUML, drawing.io and manim according to output format grammar, enabling an output generation layer to finally sequence the text into corresponding file formats, and executing grammar splicing and file packaging according to target formats.
5. A build system for producing a usable code interpretation large model, comprising a processor and a memory, the memory for storing program code and for transmitting the program code to the processor, the processor for executing the method of any of claims 1-4 in accordance with instructions in the program code.

Description

Construction method and system for producing available code interpretation large model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a method for constructing a large code interpretation model which can be used in production. Background With the rapid development of artificial intelligence technology, the field of natural language processing has achieved remarkable achievement, various large models based on natural language text training are continuously emerging, such as GPT, BERT and the like, which are excellent in applications such as text generation, question answering systems, machine translation and the like, and the development and application popularization of the natural language processing technology are greatly promoted. However, in the field of code intelligence, especially in terms of code interpretation, large models specific to code interpretation are relatively few. Traditional large models focus on scenes such as semantic recognition, intelligent question-answering and the like, and systematic support for code interpretation is lacking. Therefore, how to construct a large model for code interpretation becomes a technical problem to be solved. Disclosure of Invention In order to solve the above problems, an object of the present invention is to provide a construction method for producing a usable code interpretation large model. The invention provides a construction method of a code interpretation large model for production, which comprises the steps of selecting a basic model for training the code interpretation large model, selecting a scoring tool for evaluating the final effect of the code interpretation large model from a service angle, constructing a code interpretation knowledge graph based on GraphRAG, and constructing the code interpretation large model by adopting a post-training mode of SFT fine tuning based on the basic model. Optionally, based on the basic model, the post-training mode of SFT fine tuning is adopted to construct a code interpretation large model, which comprises adopting an open-source large model fine tuning training framework LlamaFactory to carry out SFT fine tuning on the selected qwen-72 b-instruct basic model in combination with an instruction supervision fine tuning dataset, allowing the basic model to learn an answer mode in the instruction supervision fine tuning dataset, namely, the interpretation mode and the answer format of codes to obtain the code interpretation large model of the first edition, carrying out a question-answering experiment of code interpretation on the code interpretation large model the first edition by using a test set, finding out a badcase appearing in the question-answering and an answer which does not accord with the expectations of a service expert, simultaneously obtaining a standard answer corresponding to the question, constructing a preference dataset by using the question, the badcase and the standard answer, carrying out dpo fine tuning on the code interpretation large model the first edition by adopting the open-source large model fine tuning training framework LlamaFactory to learn the preference of a user, obtaining the code interpretation large model after dpo, carrying out question-answering experiment on the code interpretation large model by using the test set, finding out a question-answering experiment of the answer for code interpretation large model after the dpo, and carrying out a final optimization experiment set, and carrying out a final experiment to obtain a final optimal dpo large model. Optionally, constructing the code interpretation knowledge graph based on GraphRAG includes picking up the satisfactory parquet files from the original catalogue outputted from GraphRAG to the picked GraphRAG output catalogue, batchwise converting all parquet files from the picked GraphRAG output catalogue to csv files, preprocessing the data sources of the edge_ texu _units_to_entities_ids.csv and the edge_ texu _units_to_ relations _hip_ids.csv to adapt the data format to the importing of the nebula-importer tools, calling nebula-importer tools by the Python program based on each yaml configuration file, importing the csv data to NebulaGraph, replacing all the displacers of the NebulaGraph imported template files, and converting the same to configuration files interpretable by the importing tools nebula-importer for use by the nebula-importer tools. Optionally, the repeater problem when the code interprets the large model reasoning is solved by means of the beam search. Use_beam_search=true, temperature=0, top_p=1, top_k= -1, best_of=5, and early_stop=true are configured in the super parameter. Optionally, solving the repeater problem when reasoning the large code interpretation model by dpo punishment repetition comprises generalizing an example of the repeated repeater, repeating the generalized example, so that correct answers chosen and repeated answer