CN-122018918-A - High-density complement input construction and complement method for warehouse-level long codes

CN122018918ACN 122018918 ACN122018918 ACN 122018918ACN-122018918-A

Abstract

The invention discloses a high-density complement input construction and completion method for warehouse-level long codes. The method comprises the steps of firstly obtaining a current code segment and a background long code context near a position to be complemented, carrying out structural analysis on the context, dividing the candidate code units, determining the code units containing the position to be complemented as query code units, carrying out multi-channel correlation retrieval on the candidate code units, screening to obtain candidate subsets, carrying out function level or class level importance evaluation and reordering on the candidate subsets, carrying out self-adaptive truncation according to attenuation relations of adjacent candidate importance gains, further generating a fine-granularity semantic unit sequence, dividing the functional blocks, carrying out importance evaluation on the functional blocks, carrying out selection or cutting on the functional blocks under budget constraint, and taking the functional blocks and the current code segments as complement input together, thereby improving information density and task correlation, reducing reasoning cost and response delay, and improving stability and availability of warehouse long code complementation.

Inventors

WANG XINGQI
BAI ZHILI

Assignees

杭州电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. A high-density completion input construction method for warehouse-level long codes is characterized by obtaining a current code segment corresponding to a position to be completed and a corresponding background long code context; The current code segment is used as query information, the correlation between candidate code units and query code units is evaluated through different indexes, and the evaluation results of different correlation indexes are de-duplicated and combined to obtain candidate subsets; Representing the complement contribution of the candidate code units to the current code segment based on the large language model scoring signal, reordering, sequentially incorporating the candidate code units according to the reordering result, calculating marginal contribution gain, and stopping continuing to incorporate when the marginal contribution gain meets a preset attenuation termination condition so as to obtain a reserved candidate set after interception; under the constraint of the context length budget, mapping the reserved candidate set into fine-grained semantic units, and selecting or clipping to generate a compression context; and splicing the compression context with the current code segment to form a high-density complement input.
2. The method for constructing high-density complement input for long codes of warehouse as set forth in claim 1 wherein said context long code context is derived from other locations of a same code file, other files in a same warehouse that have a dependency relationship with a current file, or is pre-aggregated by a warehouse-level indexing mechanism.
3. A high-density complement input construction method for warehouse-level long codes as recited in claim 1, wherein regular matching or grammar rule matching is adopted to identify declaration starting positions of functions or classes in context of the long codes, and codes between one declaration starting position and the next declaration starting position are divided into a candidate code unit.
4. The method for constructing high-density complement input for long code at warehouse level as recited in claim 1, wherein abstract syntax structure of long code at background is constructed by utilizing syntax analysis tool, and candidate code unit segmentation is performed based on declaration node boundary.
5. The method for constructing high-density complement input for warehouse-level long codes as recited in claim 1, wherein the relevance index comprises term matching, semantic representation similarity and code structure or symbol dependency.
6. The method for constructing high-density complement input for long codes of warehouse level as recited in claim 1, wherein the large language model scoring signal is AMI (c, q): AMI(c,q)=PPL(q)-PPL(q|c) Where c denotes a candidate code unit, q denotes a query code unit, PPL (q) denotes a confusion of the language model to the query code unit q without introducing the candidate code unit c, and PPL (q|c) denotes a confusion of the language model to the query code unit q when introducing the candidate code unit c as the condition information.
7. The method for constructing high-density complement input for warehouse-level long codes as recited in claim 6, wherein the marginal contribution gain is: r_i=AMI(c_{i+1},q)/AMI(c_i,q) Where r_i represents the marginal contribution gain of the i-th candidate code unit c_i, AMI (c_i, q), AMI (c_ { i+1}, q) represent the large language model scoring signals of the i-th candidate code unit c_i and the c_ { i+1} between the neighboring candidate code units, respectively.
8. The method for constructing high-density complement input for long codes of warehouse-level as recited in claim 1, wherein the fine-grained semantic units are code lines or grammar fragments obtained by partitioning based on abstract grammar structures; Detecting the boundary of the functional block by analyzing the grading change of the adjacent fine-grained semantic units under the condition of a language model by taking the contrast confusion degree as a fine-grained importance measurement mode, determining the boundary position when the grading change of the adjacent fine-grained semantic units exceeds a preset threshold value so as to divide the candidate code units into a plurality of functional blocks, then respectively calculating the contrast confusion degree scores between the candidate code units and query information of each functional block, taking the contrast confusion degree scores as the importance scores of the functional blocks, preferentially retaining the functional blocks with higher importance scores under the constraint of a target context length budget, and cutting or discarding the low-scoring functional blocks.
9. A high-density completion input construction method for a warehouse-level long code is characterized in that aiming at a position to be completed in the warehouse-level long code, the method according to any one of claims 1-8 is used for generating high-density completion input, and then the high-density completion input is input into a large-scale language model of the code to obtain a completion result of the position to be completed.
10. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-8.

Description

High-density complement input construction and complement method for warehouse-level long codes Technical Field The invention belongs to the technical field of computers, relates to artificial intelligence and software engineering, and in particular relates to a high-density complement input construction and complement method for warehouse-level long codes. Background As the application of the large code model in the integrated development environment continues to go deep, warehouse-level code completion gradually becomes an important technical form for improving development efficiency. In a warehouse-level completion scenario, the model needs to refer not only to local code segments near the location to be completed, but also to refer to background code information from the same code warehouse to understand the semantic environment, type constraints, calling relationships and cross-file dependencies of the completion points, so that a completion result which is consistent in semantic and compilable is generated. Compared with the traditional single file completion, the warehouse level completion has stronger dependence on input contexts and wider background information, so that the length of the completion input has a significant increasing trend. However, there is often an upper limit on the length of context that existing large language models can handle in a single pass, and when the background code scale is large, it has been difficult to directly enter the complete background code into the model to meet the actual deployment requirements. On one hand, the completion model reasoning process generally needs to perform operations such as attention calculation on an input sequence, the occupation and calculation cost of a video memory are obviously improved due to the increase of the input length, reasoning delay is increased, and therefore the instantaneity, concurrency capacity and user experience of the interactive code completion are affected, on the other hand, even if part of models support longer context windows, the utilization efficiency of the models on key information is reduced due to overlong input, phenomena such as insufficient attention on key information in the middle of the long sequence occur, and the completion result is unstable. Furthermore, with the popularity of commercial deployments and API calls, longer inputs also mean higher call costs and more costly computational costs, making engineering direct "heap contexts" not sustainable. Therefore, on the premise of not sacrificing the complement quality, the key problems to be solved by the warehouse-level code complement system are to be solved by reducing the complement input length, improving the input information density and controlling the reasoning cost. Aiming at the problems, the prior art generally directly cuts off background codes, only partial context close to the completion position is reserved to construct completion input, but the method does not fully consider cross-function and cross-file dependency commonly existing in a code warehouse, and definition, calling or constraint information strongly related to the completion point is easily cut off and lost, so that the correctness and stability of the completion result are affected. In order to alleviate the problem of information missing caused by direct truncation, part of schemes attempt to cut the background code through heuristic rules or static statistical features, for example, identify redundant sentences, repeated structures or low-value fragments and execute deletion or merging processing, so that the input scale can be reduced to a certain extent, but because the input scale mainly depends on predefined rules or static features, it is difficult to dynamically perceive which information is truly useful for current completion, and the situation that critical contexts are deleted by mistake or irrelevant fragments are reserved easily occurs under different projects, different code styles and different completion scenes, so that the completion effect is unstable. In addition, research is also carried out on evaluating the importance of candidate code segments by utilizing the scoring signals of a large language model, and sequencing and selecting are carried out according to the importance, although the internal cognition of the model to the completion task can be utilized to a certain extent, the model can still be interfered by template codes, naming styles or grammar structural similarity, noise segments which are weakly related or even irrelevant to the completion task are misjudged as high-value contexts, so that the stability of the completion effect is influenced, and meanwhile, when the candidate scale is large, the repeated execution of the model scoring evaluation brings higher calculation expenditure and reasoning delay, and the requirements of the interactive code completion on the efficiency and the cost are difficult to meet. Therefore, the prior ar