CN-122021650-A - Vector characterization method of code block and electronic equipment
Abstract
The application provides a vector characterization method of a code block and electronic equipment, and relates to the technical field of data processing. The method comprises the steps of obtaining target source codes, constructing an abstract syntax tree of the target source codes, identifying a plurality of atomic semantic units in the target source codes based on the abstract syntax tree of the target source codes, identifying context environment information of each atomic semantic unit based on a pre-constructed global dependency reference graph and the abstract syntax tree, wherein the context environment information comprises host context information, data context information and dependency context information, generating enhanced virtual code blocks corresponding to each atomic semantic unit according to the context environment information of each atomic semantic unit, storing the enhanced virtual code blocks, wherein the enhanced virtual code blocks comprise structural semantic information of the atomic semantic units and context environment information of the atomic semantic units, realizing high-fidelity vectorization of code semantics, and solving the problem of context loss caused by code fragmentation in the prior art.
Inventors
- HE PENG
- LIANG JUN
- LIU HONG
- GAO BIN
Assignees
- 成都新希望金融信息有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260213
Claims (10)
- 1. A method of vector characterization of a code block, the method comprising: acquiring a target source code and constructing an abstract syntax tree of the target source code; Identifying a plurality of atomic semantic units in the target source code based on an abstract syntax tree of the target source code; Based on a pre-constructed global dependency reference graph and the abstract syntax tree, identifying context environment information of each atomic semantic unit, wherein the context environment information comprises host context information, data context information and dependency context information, and the global dependency reference graph is used for identifying cross-file function call relations, parent class inheritance, interface realization and global variable references in the atomic semantic units; generating an enhanced virtual code block corresponding to each atomic semantic unit according to the context environment information of each atomic semantic unit, and storing the enhanced virtual code block, wherein the enhanced virtual code block comprises original codes corresponding to the atomic semantic units, structural semantic information of the atomic semantic units and the context environment information of the atomic semantic units.
- 2. The method of claim 1, wherein constructing the abstract syntax tree of the target source code comprises: using a preset grammar analysis tool to carry out full static analysis on the target source code to obtain an analysis result; And constructing an abstract syntax tree of the target source code according to the analysis result.
- 3. The method of claim 1, wherein the identifying a plurality of atomic semantic units in the target source code based on the abstract syntax tree of the target source code comprises: And identifying a plurality of specific semantic nodes contained in the abstract syntax tree, and taking each specific semantic node as an atomic semantic unit, wherein the specific semantic nodes at least comprise a method definition node and a class definition node.
- 4. The method of claim 1, wherein identifying context information for each of the atomic semantic units based on the pre-constructed global dependency graph and the abstract syntax tree comprises: determining class definition nodes to which the atomic semantic units belong according to the hierarchical relation of the abstract syntax tree, and acquiring host context information of the atomic semantic units based on the class definition nodes, wherein the host context information comprises names of the classes to which the atomic semantic units belong, inherited father classes, implementation interfaces and class comments; determining an abstract syntax subtree corresponding to the atomic semantic unit in the abstract syntax tree, and identifying data context information of the atomic semantic unit from the abstract syntax subtree, wherein the data context information comprises a referenced member variable, a global constant and an enumeration value; And identifying the dependency context information of the atomic semantic unit from the global dependency reference graph, wherein the dependency context information comprises global constants, imported external packages, tool class references and cross-file calling relations.
- 5. The method of claim 4, wherein determining class definition nodes where the atomic semantic units are located according to the hierarchical relationship of the abstract syntax tree comprises: Traversing upwards according to the hierarchical parent-child relationship of the abstract syntax tree to determine class definition nodes to which the atomic semantic units belong, wherein the upward traversing process comprises the steps of sequentially checking parent node types along a scope chain in the abstract syntax tree until the parent node types are matched with the class definition or method definition type nodes; the identifying the dependency context information of the atomic semantic unit from the global dependency reference graph includes: And retrieving the related external reference relation of the atomic semantic unit from the global dependency reference graph, and extracting the dependency context information of the atomic semantic unit from the external reference relation.
- 6. The method of claim 1, wherein generating the enhanced virtual code block corresponding to each of the atomic semantic units according to the context information of each of the atomic semantic units comprises: Performing correlation filtering on the context environment information of the atomic semantic units to obtain pruned context information of the atomic semantic units; carrying out private logic inlining on the context environment information of the atomic semantic unit to obtain a code to be inlined in the atomic semantic unit; And generating an enhanced virtual code block corresponding to the atomic semantic unit according to the context information after pruning and the code to be inlined.
- 7. The method of claim 6, wherein the performing correlation filtering on the context information of the atomic semantic unit to obtain pruned context information of the atomic semantic unit comprises: Performing static data flow analysis, and identifying actual referenced relations of all member variables in the context environment information to eliminate member variables without actual referenced relations, so as to obtain context information after pruning of the atomic semantic unit; The private logic inlining is performed on the context environment information of the atomic semantic unit to obtain a code to be inlined in the atomic semantic unit, which comprises the following steps: And determining a function call chain of the atomic semantic unit according to the dependency context information of the atomic semantic unit, and if the fact that a sub-function called in the function call chain is a non-public method and the code complexity index of the sub-function is lower than a preset threshold value is detected, extracting a source code of the sub-function and marking the source code of the sub-function as a code to be inlined.
- 8. The method of claim 6, wherein generating the enhanced virtual code block corresponding to the atomic semantic unit according to the pruned context information and the code to be inlined comprises: Based on a preset semantic injection template, splicing the context information after pruning and the code to be embedded to a source code head of the atomic semantic unit or a calling position of the code to be embedded in the atomic semantic unit, and generating an enhanced virtual code block corresponding to the atomic semantic unit.
- 9. The method of claim 1, wherein the storing the enhanced virtual code block comprises: inputting the semantic enhancement code blocks into a pre-trained code embedding model, generating high-dimensional feature vectors corresponding to the atomic semantic units, and storing the high-dimensional feature vectors.
- 10. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium in communication over the bus when the electronic device is in operation, the processor executing the machine-readable instructions to perform the steps of the method of any one of claims 1-9.
Description
Vector characterization method of code block and electronic equipment Technical Field The present application relates to the field of data processing technologies, and in particular, to a method for representing a vector of a code block and an electronic device. Background In recent years, a source code characterization learning (Source Code Representation Learning, abbreviated as SCRL) technology based on deep learning has been developed, and has become one of core technologies for promoting intelligent development of software. This type of technique aims to translate program code into a low-dimensional dense vector representation (i.e. "code embedding") such that semantically close code segments are closer together in vector space, supporting efficient modeling of downstream tasks. In the related art, a source code is generally regarded as a general text, and is divided into a plurality of code segments (e.g., truncated by 512 Token) based on a physical slicing manner of a fixed length or sliding window, and each code segment is input to an embedding model to generate a vector representation of each code segment. However, this use of forced truncation or sliding window mechanisms causes incomplete or corrupted structure of the source code (e.g., unpacking loop bodies or conditional decisions), and creates illegal or incomplete code segments, resulting in a lack of specificity in the generated vector representations. Disclosure of Invention The application aims to provide a vector characterization method of code blocks and electronic equipment aiming at the defects in the prior art so as to solve the technical problems in the prior art. In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows: in a first aspect, an embodiment of the present application provides a method for vector characterization of a code block, where the method includes: acquiring a target source code and constructing an abstract syntax tree of the target source code; Identifying a plurality of atomic semantic units in the target source code based on an abstract syntax tree of the target source code; Based on a pre-constructed global dependency reference graph and the abstract syntax tree, identifying context environment information of each atomic semantic unit, wherein the context environment information comprises host context information, data context information and dependency context information, and the global dependency reference graph is used for identifying cross-file function call relations, parent class inheritance, interface realization and global variable references in the atomic semantic units; generating an enhanced virtual code block corresponding to each atomic semantic unit according to the context environment information of each atomic semantic unit, and storing the enhanced virtual code block, wherein the enhanced virtual code block comprises original codes corresponding to the atomic semantic units, structural semantic information of the atomic semantic units and the context environment information of the atomic semantic units. Optionally, the constructing the abstract syntax tree of the target source code includes: using a preset grammar analysis tool to carry out full static analysis on the target source code to obtain an analysis result; And constructing an abstract syntax tree of the target source code according to the analysis result. Optionally, the identifying, based on the abstract syntax tree of the target source code, a plurality of atomic semantic units in the target source code includes: And identifying a plurality of specific semantic nodes contained in the abstract syntax tree, and taking each specific semantic node as an atomic semantic unit, wherein the specific semantic nodes at least comprise a method definition node and a class definition node. Optionally, the identifying the context environment information of each atomic semantic unit based on the pre-constructed global dependency reference graph and the abstract syntax tree includes: determining class definition nodes to which the atomic semantic units belong according to the hierarchical relation of the abstract syntax tree, and acquiring host context information of the atomic semantic units based on the class definition nodes, wherein the host context information comprises names of the classes to which the atomic semantic units belong, inherited father classes, implementation interfaces and class comments; determining an abstract syntax subtree corresponding to the atomic semantic unit in the abstract syntax tree, and identifying data context information of the atomic semantic unit from the abstract syntax subtree, wherein the data context information comprises a referenced member variable, a global constant and an enumeration value; And identifying the dependency context information of the atomic semantic unit from the global dependency reference graph, wherein the dependency contex