CN-121996252-A - ANTLR-based code block identification and dependency tree construction method and device

CN121996252ACN 121996252 ACN121996252 ACN 121996252ACN-121996252-A

Abstract

The application discloses an ANTLR-based code block identification and dependency tree construction method and device. The method comprises the steps of defining an ANTLR grammar rule file for a target programming language, automatically generating a lexical analyzer and a grammar analyzer by utilizing an ANTLR tool, converting a source code into a grammar analysis tree, traversing the grammar analysis tree by adopting a monitor mode, introducing a stack data structure to dynamically track a code block context, constructing code block nodes by rewriting an entry method and an exit method related to the code blocks, establishing parent-child hierarchical relations among the nodes, constructing a dependency tree based on the hierarchical relations, wherein each node comprises type, name, position and parent-child reference information, and finally rendering and outputting a tree structure. According to the application, the parser is automatically generated through formalized grammar rules, so that manual word segmentation and hard coding logic of the traditional method are avoided, and the accuracy, efficiency and maintainability of code block identification and dependency construction are remarkably improved.

Inventors

Peng Zuyu
LIU PING

Assignees

北京海量数据技术股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. An ANTLR-based code block identification and dependency tree construction method, the method comprising: S1, defining an ANTLR grammar rule file for a target programming language, wherein the ANTLR grammar rule file comprises lexical rules of a description language basic vocabulary unit and grammar rules of a description language grammar structure; S2, based on the ANTLR grammar rule file, automatically generating corresponding lexical analyzers, grammar analyzers and basic monitor classes by using an ANTLR tool; S3, inputting source codes of the target programming language into the lexical analyzer to be converted into lexical symbol streams, and analyzing the lexical symbol streams into syntax analysis trees by the syntax analyzer according to the syntax rules; S4, traversing the syntax analysis tree by adopting a monitor mode, introducing a stack data structure in the traversing process to dynamically track the context of the current code block, and constructing code block nodes and establishing parent-child hierarchical relations among the code block nodes by rewriting an entry method and an exit method related to the code block in the basic monitor class; s5, constructing and rendering an output dependency tree based on the established code block hierarchical relationship, wherein each code block node in the dependency tree at least comprises a code block type, a name, source code position information, a parent node reference and a child node list.
2. The method of claim 1, wherein the target programming language in step S1 comprises SQL, java, python; The ANTLR grammar rule file adopts a g4 file format supporting LL (x) parsing algorithm to handle left recursion and grammar ambiguity.
3. The method of claim 1, wherein step S3 further comprises constructing a usable parse tree using an ANTLR error recovery mechanism in the event of a syntax error in the source code.
4. The method according to claim 1, wherein step S4 specifically comprises: In the entry method of the rewritten basic monitor class, a corresponding code block node is created, the association between the code block node and a stack top father node is established, and the node is pressed into a stack; in the exiting method of the rewritten basic listener class, the stack-stripping operation is executed to complete the construction of the current code block node.
5. The method according to claim 1, wherein step S4 specifically comprises: (1) When entering a code block, creating a code block node in a corresponding entering method, pushing the code block node into a stack, and recording the hierarchical relationship between the code block node and a parent code block represented by the next node in the stack; (2) When traversing the sentences in the code block, taking the trestle top node as the current code block, and recording the discovered sub-code block as a sub-node of the current code block node; (3) When exiting the code block, the code block node is popped from the stack in the corresponding exit method.
6. The method according to claim 1, wherein the root node of the dependency tree represents the entire source code script in step S5, its child nodes correspond to top-level code blocks, and the inner nested code blocks are recursively constructed as lower child nodes of the corresponding parent nodes; each code block node in the dependency tree also includes symbol table information, data stream information, and code metric indicators.
7. The method of claim 1, wherein the rendering output in step S5 comprises serializing the dependency tree into JSON, XML format, or graphically exposing the tree structure.
8. The method of claim 1, further comprising employing an incremental parsing strategy when the parsed object is a large-scale source code file and storing only critical metadata information in the code block nodes, the source code details being read from the original input as needed to reduce memory usage.
9. An ANTLR-based code block identification and dependency tree construction apparatus, the apparatus comprising: The grammar rule definition module is used for defining an ANTLR grammar rule file for the target programming language, wherein the ANTLR grammar rule file comprises lexical rules of a description language basic vocabulary unit and grammar rules of a description language grammar structure; the code generation module is used for automatically generating corresponding lexical analyzers, grammar analyzers and basic monitor classes by utilizing an ANTLR tool based on the ANTLR grammar rule file; The grammar analysis module is used for inputting the source code of the target programming language into the lexical analyzer to be converted into a lexical symbol stream, and then the lexical symbol stream is analyzed into a grammar analysis tree by the grammar analyzer according to the grammar rule; The traversal processing module is used for traversing the syntax analysis tree by adopting a monitor mode, introducing a stack data structure in the traversal process to dynamically track the context of the current code block, identifying each code block and establishing a parent-child hierarchical relationship among nodes of each code block by rewriting an entry method and an exit method related to the code block in the basic monitor class; And the dependency tree construction module is used for constructing and rendering an output dependency tree based on the established code block hierarchical relationship, and each code block node in the dependency tree at least comprises a code block type, a name, source code position information, a parent node reference and a child node list.
10. An electronic device is characterized by comprising a memory and a processor; A memory for storing a computer program; a processor for executing the computer program to implement the steps of the ANTLR based code block identification and dependency tree construction method according to any one of claims 1 to 8.

Description

ANTLR-based code block identification and dependency tree construction method and device Technical Field The application relates to the technical fields of computer software engineering, program analysis and processing, in particular to a ANTLR (Another Tool for Language Recognition) -based code block identification and dependency tree construction method, a ANTLR (Another Tool for Language Recognition) -based code block identification and dependency tree construction device, a computer-readable storage medium and electronic equipment, which are mainly applied to scenes such as source code analysis, code structure analysis, dependency relation extraction, software architecture visualization and the like. Background Code block identification and dependency tree construction are two fundamental key tasks in the field of program analysis and understanding. Code block identification aims at decomposing source code into logically independent semantic units (e.g., functions, classes, control structures, etc.), and dependency tree construction is used to reveal calls, inclusions, and their dependencies between these units. The accuracy and efficiency of these two tasks directly affect the effectiveness of upper layer applications such as code understanding, static analysis, reconstruction, debugging, and system architecture recovery. 1. Limitations of conventional code block identification methods Early code analysis tools commonly employ custom lexical analysis based on native programming language implementation, relying on heuristic rules such as keyword matching, bracket counting, and indentation analysis to accomplish code block identification. This method has the following inherent drawbacks: the accuracy is insufficient, the grammar complexity of the modern programming language is high, and all grammar scenes are difficult to process correctly only by matching keywords with brackets. For example, nested sub-query structures in the SQL language often result in level boundary misinterpretations. The maintainability is poor, grammar rules of different programming languages have obvious differences, word segmentation logic needs to be independently realized and maintained for each language, and when language standards are updated, corresponding rules also need to be synchronously modified, so that the maintenance cost is high. Performance bottleneck-in order to process complex grammar scenes, word segmentation logic usually contains a large number of conditional branch judgments, and when large-scale code files are analyzed, the execution efficiency is obviously reduced. 2. Limitations of traditional dependency tree construction methods Conventional approaches typically employ hard-coded logic (e.g., deep nested if-else or switch-case statements) to infer dependencies between code blocks, such as identifying stored procedure CALLs by pattern-matching CALL statements. The method has the following problems: the flexibility is poor, hard coding logic is sensitive to code style changes, and the diversity of grammar expression is difficult to adapt. And the expandability is poor, and the expandability is limited because the new language support or grammar structure needs to modify the core processing logic. The reliability is low, the boundary condition of complex conditional branches is easy to miss, and misjudgment is easy to generate under the scenes of dynamic SQL or conditional compiling and the like. 3. Other technical challenges In addition to the above methodological drawbacks, conventional implementations face a dual challenge of performance and accuracy: Performance bottleneck-when resolving ten thousand lines of code files, memory consumption may exceed 6GB, resulting in system response delay and even crash. The accuracy is poor, the traditional method is difficult to effectively process the context-dependent grammar, and the structure widely exists in SQL and other languages and is easy to cause analysis errors. The ANTLR technique provides a possibility for solving the above-mentioned problems by automatically generating a lexical analyzer and a syntax analyzer and constructing a complete syntax analysis tree. However, how to efficiently use ANTLR to realize accurate code block identification and construct a tree structure capable of accurately reflecting the dependency relationship of code blocks is still a hot technical problem to be solved currently. Disclosure of Invention In order to cope with the technical requirements and overcome a series of defects existing in the prior art, the application provides a novel code block identification and dependency tree construction method based on ANTLR. In modern software engineering, the analysis and understanding of program code is a fundamental and critical element. Conventional code analysis methods typically employ manual writing of parsers or regular expression-based text processing, which are not only inefficient to develop, but also difficult to accurat