CN-121979506-A - Code completion model training method and device based on semantic attribution

CN121979506ACN 121979506 ACN121979506 ACN 121979506ACN-121979506-A

Abstract

The application provides a code completion model training method and device based on semantic attribution, and the code completion model training method based on semantic attribution comprises the steps of obtaining training codes, identifying a plurality of first variables from the training codes, deleting target code fragments from the training codes to obtain initial contexts, identifying second variables from the plurality of first variables according to the initial contexts, wherein the second variables are the first variables with declaration sentences but not referenced in the initial contexts, deleting declaration sentences of the second variables in the initial contexts to obtain target contexts, and carrying out model training on the initial models according to the target contexts and the target code fragments to obtain a code completion model. The application can overcome deterministic constraint, so that the predicted code segment output by the model accords with real business logic, and the code prediction accuracy is improved.

Inventors

NING WEI
JIANG SIYUAN
LIU YANG
LI GE

Assignees

北京硅心科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (9)

1. A semantic attribution-based code completion model training method, the method comprising: Acquiring training codes, wherein the training codes comprise a pre-code segment, a post-code segment and an object code segment positioned between the pre-code segment and the post-code segment; Identifying a plurality of first variables from the training code, the first variables being variables for which declaration statements exist and are referenced in the training code; Deleting the target code segment from the training code to obtain an initial context, wherein the initial context comprises the pre-code segment and the post-code segment; identifying a second variable from the plurality of first variables according to the initial context, the second variable being a first variable in which a declaration statement exists but is not referenced in the initial context; deleting the statement sentence of the second variable in the initial context to obtain a target context; And carrying out model training on the initial model according to the target context and the target code segment to obtain a code complement model.
2. The method of claim 1, wherein identifying a second variable from the plurality of first variables based on the initial context comprises: Identifying a plurality of third variables from the initial context, the third variables being variables in which declaration statements exist and are referenced in the initial context; and carrying out differential calculation on the first variables and the third variables, and determining a second variable according to a differential calculation result.
3. The method of claim 2, wherein deleting the declaration statement of the second variable in the initial context results in a target context, comprising: Deleting declaration statements of a first preset percentage of the second variable in the initial context; Deleting statement sentences of the third variable with a second preset percentage in the initial context, wherein the first preset percentage is larger than the second preset percentage.
4. The method according to claim 1 or 2, wherein identifying a plurality of first variables from the training code comprises: Identifying a plurality of identifiers from the training code; For any identifier, determining the identifier as a candidate variable if the identifier simultaneously does not meet a plurality of conditions, wherein the plurality of conditions comprise that the identifier is a predefined identifier, the identifier is a structure body field name and the identifier is a suffix identifier on a selector chain; if the candidate variable has a declaration statement in the training code, the candidate variable is determined to be a first variable.
5. The method according to claim 1 or 2, wherein model training the initial model according to the target context and the target code segment to obtain a code complement model comprises: Calling the initial model to output a predicted code segment according to the target context; Calculating a loss value between the predicted code segment and the target code segment; If the loss value does not meet a preset convergence condition, adjusting model parameters of the initial model according to the loss value to obtain a new initial model, and repeatedly executing the step of calling the initial model to output a predicted code segment according to the target context for the new initial model; And if the loss value meets the preset convergence condition, obtaining the code completion model.
6. The method according to claim 1 or 2, characterized in that the training code is a Go language code.
7. A semantic attribution-based code completion model training apparatus, the apparatus comprising: the training code acquisition module is used for acquiring training codes, wherein the training codes comprise a pre-code segment, a post-code segment and an object code segment positioned between the pre-code segment and the post-code segment; The first variable identification module is used for identifying a plurality of first variables from the training codes, wherein the first variables are variables which have declaration sentences in the training codes and are referenced; an initial context generating module, configured to delete the target code segment from the training code to obtain an initial context, where the initial context includes the pre-code segment and the post-code segment; a second variable identification module, configured to identify a second variable from the plurality of first variables according to the initial context, where the second variable is a first variable that has a declaration statement in the initial context but is not referenced; The target context generating module is used for deleting the statement sentence of the second variable in the initial context to obtain a target context; And the model training module is used for carrying out model training on the initial model according to the target context and the target code segment to obtain a code complement model.
8. A computer device, comprising: a memory and a processor in communication with each other, the memory having stored therein computer instructions that, upon execution, perform the semantically attribution based code complement model training method of any of claims 1-6.
9. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the semantic attribution based code complement model training method of any of claims 1-6.

Description

Code completion model training method and device based on semantic attribution Technical Field The application relates to the technical field of large models, in particular to a code complement model training method and device based on semantic attribution. Background The Go language acts as a static compiled code language whose language specifications mandate that variables must be used after declaration, otherwise a compilation error is triggered. This design, while capable of improving the quality of the compiled code, presents challenges to the implementation of code completion on the code completion model. Existing code completion models commonly employ a training paradigm of "above-mid-below," i.e., a complete code is divided into prefix, middle and diffix parts, and the training model predicts the mid based on prefix and diffix. Since the training data are all codes conforming to the code language specification, the variable declarations form a strong correlation with the use, and the model learns the deterministic pattern of "use after declaration". However, in a practical reasoning scenario, the model can only see the prefix containing the variable declaration, while the compiling constraint of the Go language forms a strong hint leak, i.e. the model knows that the middle must contain the variable reference to meet the compiling requirement, resulting in reduced conditional distribution entropy and a tendency of the output to be homogenous. This deterministic constraint makes the model prone to high frequency but low precision usage patterns, rather than code that conforms to real business logic, causing problems with low accuracy of the sample predictions. Disclosure of Invention In view of the above, the application provides a training method and a training device for a code completion model based on semantic attribution, which are used for solving the problem that the code completion model suffers deterministic prompt leakage during reasoning and the code prediction accuracy is reduced due to compiling constraint of Go language in the related technology. An embodiment of a first aspect of the present application provides a semantic attribution-based code complement model training method, which includes: Acquiring training codes, wherein the training codes comprise a pre-code segment, a post-code segment and an object code segment positioned between the pre-code segment and the post-code segment; Identifying a plurality of first variables from the training code, the first variables being variables for which declaration statements exist and are referenced in the training code; Deleting the target code segment from the training code to obtain an initial context, wherein the initial context comprises the pre-code segment and the post-code segment; identifying a second variable from the plurality of first variables according to the initial context, the second variable being a first variable in which a declaration statement exists but is not referenced in the initial context; deleting the statement sentence of the second variable in the initial context to obtain a target context; And carrying out model training on the initial model according to the target context and the target code segment to obtain a code complement model. According to the embodiment of the application, the second variables are identified from the plurality of first variables according to the initial context, declaration sentences of the second variables in the initial context are deleted to obtain the target context, and the target context is utilized for model training, so that the certainty constraint that the variables have declaration sentences but are not referenced in the target context for model training can be ensured, and the problem that the code prediction accuracy is low because the model knows that the output predicted code fragments must contain variable references to meet the compiling requirement is solved, the conditional distribution entropy is reduced, the output tends to be homogenized, and the code output by the model is supplemented to conform to the real business logic is solved. In an embodiment of the present application, identifying a second variable from the plurality of first variables according to the initial context includes: Identifying a plurality of third variables from the initial context, the third variables being variables in which declaration statements exist and are referenced in the initial context; and carrying out differential calculation on the first variables and the third variables, and determining a second variable according to a differential calculation result. In the embodiment of the present application, deleting the statement sentence of the second variable in the initial context to obtain the target context includes: Deleting declaration statements of a first preset percentage of the second variable in the initial context; Deleting statement sentences of the third variable with a se