Search

CN-121996247-A - Multi-language program structure unified modeling method and system based on semantic preservation

CN121996247ACN 121996247 ACN121996247 ACN 121996247ACN-121996247-A

Abstract

The invention discloses a multi-language program structure unified modeling method and system based on semantic preservation, and relates to the technical field of program analysis and network security. The method comprises the steps of receiving source codes of multiple programming languages, generating corresponding original abstract syntax trees by utilizing a front-end parser configured with a declarative mapping framework DSMF, converting the original abstract syntax trees AST into initial nodes and edges of a unified structural semantic graph SSG according to a predefined mapping rule, receiving type information of the multiple programming languages, mapping the type information into unified five-tuple semantic vectors by utilizing a multi-language unified type system MLUTS, and annotating the semantic vectors as attributes on the corresponding nodes of the structural semantic graph SSG. The method can construct a unified model containing control flow, data flow and program semantics, effectively supports static vulnerability detection and deep stain analysis in a cross-language environment, and remarkably improves the coverage rate and accuracy of analysis.

Inventors

  • YUAN MINGKUN
  • SU YUPENG
  • LU YINING
  • WEI CHENYU
  • WEI GUOWEN
  • WANG HUIBO
  • ZHU LINLIN
  • WANG QI
  • ZHANG YINGZHOU
  • YUAN XUDONG
  • SUN HAORAN
  • Sun Shidai

Assignees

  • 杭州安恒信息技术股份有限公司
  • 南京邮电大学

Dates

Publication Date
20260508
Application Date
20260120

Claims (10)

  1. 1. A multi-language program structure unified modeling method based on semantic preservation is characterized by comprising the following steps: Receiving source codes of multiple programming languages, generating a corresponding original abstract syntax tree by utilizing a front-end parser configured with a declarative mapping framework DSMF, and converting the original abstract syntax tree AST into initial nodes and edges of a unified structural semantic graph SSG according to a predefined mapping rule; receiving type information of multiple programming languages, mapping the type information into unified five-tuple semantic vectors by utilizing a multi-language unified type system MLUTS, annotating the semantic vectors as attributes on corresponding nodes of the structured semantic graph SSG, and realizing semantic alignment of cross-language types; wherein the five-tuple semantic vector explicitly describes the basic type, ownership semantics, life cycle constraints, memory layout features and concurrency state of the data; identifying an FFI call point and a definition point of an external function interface in a source program, calculating a memory layout fingerprint of a data structure according to an ABI specification of an application program binary interface of a target running environment, constructing an FFI descriptor, and generating bridging edges connected with different language subgraphs in an SSG; And constructing a control flow edge, a data flow edge and a calling edge on the basis of the initial node and the bridging edge of the structured semantic graph SSG to form a complete program structure model comprising a physical layer, a logical layer and a semantic layer.
  2. 2. The method for unified modeling of a multi-lingual program structure based on semantic preservation according to claim 1, wherein the plurality of programming languages includes Java, rust, C/C++, go language.
  3. 3. The method for unified modeling of multi-language program structure based on semantic preservation according to claim 2, wherein the method is characterized in that a front-end parser configured with a declarative mapping framework is utilized to generate a corresponding native abstract syntax tree, and the native abstract syntax tree is converted into initial nodes and edges of a unified structured semantic graph SSG according to a predefined mapping rule, and is specifically as follows: The rule configuration loading, namely loading a predefined declarative mapping rule base by a system, wherein the rule base adopts YAML or JSON format, and defines the static mapping relation between abstract syntax tree AST node types of different programming languages and unified structural semantic graph SSG standard node types; hierarchical mapping, namely traversing the input native AST by a mapping engine, and executing hierarchical mapping according to the rule base: for the general program structure, the static rule is directly applied to perform one-to-one or one-to-many mapping, and for the complex structure specific to language, the corresponding dynamic processing logic is triggered; Dynamic semantic adaptation, namely calling an embeddable script plug-in for the complex structure, wherein the script plug-in comprises logic for generating or transforming SSG subgraphs so as to ensure that language high-level semantics are reserved and accurately expressed in SSGs; And (3) attribute binding and outputting, namely extracting key attributes including identifiers and type notes in the source node according to mapping rules while finishing node type mapping, binding the key attributes to the generated SSG nodes, and finally outputting an intermediate representation formed by the initial SSG nodes and edges.
  4. 4. The method for unified modeling of multi-language program structure based on semantic preservation according to claim 3, wherein the unified structured semantic graph SSG is a multi-attribute directed graph, and the node types at least comprise: Structure definition nodes, representing static hierarchical structures of modules, classes, functions and interfaces; an operation execution node represents the specific instruction behaviors of declaration, assignment, calling and operation; and the semantic anchor point node represents a control flow break point of abnormal throwing, resource release and asynchronous suspension.
  5. 5. The method for unified modeling of a multi-lingual program structure based on semantic preservation according to claim 4, wherein the five-tuple semantic vector T is defined as t= < τ base ,μ own , λ life , δ mem ,ξ sync >, wherein: τ base represents the basic data structure type for unifying scalar and compound types in different languages; Mu own represents ownership semantic weight, which is used for describing memory management attribution of variables, and the values at least comprise exclusive Owned, borrowed Borrowed, shared and external managed Foreign; lambda life represents a life cycle constraint set, and the survival range of the descriptive variable is identified through interval logic or a scope; Delta mem represents a memory layout fingerprint describing byte alignment, padding patterns, and field offsets in the memory of the data structure; ζ sync represents a concurrent primitive state for marking a synchronization lock, atomic operation, or channel state.
  6. 6. The unified modeling method of multi-language program structure based on semantic preservation as claimed in claim 5, wherein the method is characterized in that the method comprises the steps of identifying the FFI call points and definition points of an external function interface in a source program, calculating the memory layout fingerprint of a data structure according to the ABI specification of an application program binary interface of a target running environment, constructing FFI descriptors, and generating bridging edges for connecting different language subgraphs in an SSG, wherein the method comprises the following steps: extracting a parameter type and a return value type in cross-language call; Simulating and calculating the layout of each type in the memory according to the ABI specification of the target platform, and generating a memory layout fingerprint; Comparing the memory layout fingerprints of the calling party and the called party, and marking the memory layout risk attribute on the bridging edge if the fingerprints are inconsistent; and inserting an adaptation node into the SSG according to the call constraint, and simulating the data flow of the parameter push stack and register transfer process.
  7. 7. The method for unified modeling of multi-language program structure based on semantic preservation as claimed in claim 6, wherein the method is characterized in that control flow edges, data flow edges and call edges are constructed on the basis of initial nodes and bridging edges of the structured semantic graph SSG to form a complete program structure model comprising a physical layer, a logical layer and a semantic layer, and the method is as follows: establishing a directed edge between an operation execution node and a semantic anchor node according to the sequence, branches, circulation and exception handling structure in the program, wherein the directed edge represents the logic sequence and path transfer of program execution; Establishing a data stream edge based on the definition and use relation of variables, combining all semantic weights and life cycle constraint set attributes in five-tuple semantic vectors, and establishing a data dependency edge between a definition node and a use node to support cross-function and cross-semantic data stream tracking; And constructing a calling edge, namely establishing the calling edge between a calling node and an entry node of the called function according to the function calling relation, supporting the same representation of calling and cross-language FFI calling in the same language, and transmitting semantic information of parameters and return values.
  8. 8. A multi-language program structure unified modeling system based on semantic preservation, for implementing the multi-language program structure unified modeling method based on semantic preservation as claimed in any one of claims 1 to 7, comprising: The multi-language front-end adaptation module is used for receiving source codes of various programming languages, generating a corresponding original abstract syntax tree by utilizing a front-end parser configured with a declarative mapping framework DSMF, and converting the original abstract syntax tree AST into initial nodes and edges of a unified structural semantic graph SSG according to a predefined mapping rule; the unified type engine module is used for receiving type information of multiple programming languages, mapping the type information into unified five-tuple semantic vectors by utilizing the multi-language unified type system MLUTS, and annotating the semantic vectors as attributes on corresponding nodes of the structured semantic graph SSG so as to realize semantic alignment of cross-language types; wherein the five-tuple semantic vector explicitly describes the basic type, ownership semantics, life cycle constraints, memory layout features and concurrency state of the data; The cross-language linking module is used for identifying an FFI call point and a definition point of an external function interface in a source program, calculating a memory layout fingerprint of a data structure according to an ABI specification of an application program binary interface of a target running environment, constructing an FFI descriptor, and generating bridging edges for connecting different language subgraphs in an SSG; The diagram construction core module is used for constructing a control flow side, a data flow side and a call side on the basis of an initial node and a bridging side of the structured semantic diagram SSG to form a complete program structure model comprising a physical layer, a logic layer and a semantic layer.
  9. 9. The system for unified modeling of multi-language program structure based on semantic preservation according to claim 8, wherein the unified type engine module is built-in MLUTS algorithm, which comprises a semantic dimension reducer for: mapping garbage collection objects of Java/Go into an 'external hosting' or 'shared' ownership state; Mapping the Rust/C++ intelligent pointer into a 'single-occupied' or 'shared' ownership state, and reserving the analysis and construction semantics of the intelligent pointer; it is ensured that the languages of the different memory management models have comparable semantic expressions in SSGs.
  10. 10. A terminal device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, characterized in that the processor, when loading and executing the computer program, employs the method for unified modeling of a multi-lingual program structure based on semantic preservation according to any of claims 1 to 7.

Description

Multi-language program structure unified modeling method and system based on semantic preservation Technical Field The invention relates to the technical field of program analysis and network security, in particular to a multi-language program structure unified modeling method and system based on semantic preservation. Background With the rapid development of the software industry, modern large-scale software systems are increasingly complex, and multi-language hybrid programming has become a mainstream development paradigm. For example, java or Go is used to build high concurrency business logic while the underlying high performance module written in C/C++ or Rust is called through JNI or CGo. Although the heterogeneous architecture improves the development efficiency and the system performance, the heterogeneous architecture also brings serious challenges to software quality assurance and safety detection, and the existing program structure modeling and static analysis method mainly has the following defects: 1. "language islanding" effect-traditional static analysis tools (e.g., findBugs, CHECKSTYLE, etc.) are typically designed for a single language and cannot understand call relationships across languages. When the analysis flow touches a language boundary (such as Java call C function), analysis is often forced to be interrupted, so that a call diagram is broken, and an analysis blind area is formed. 2. Semantic loss of Intermediate Representation (IR) to enable multi-language support, a partial tool (e.g., LLVM-based analyzer) converts source code into an underlying Intermediate Representation (IR). However, the underlying IR such as LLVM IR mainly serves for compilation optimization, and its level of abstraction is too low, resulting in the stripping of semantics in the source code (e.g. ownership rules of Rust, generic constraints of Java, channel mechanisms of Go) during the conversion process. Under the condition of lacking the semantics of the programs, the security analysis tool is difficult to accurately detect complex loopholes such as memory leakage, dangling pointers or concurrent deadlocks. 3. FFI boundary modeling is missing-existing unified modeling methods tend to ignore the underlying details at the external function interface. Different languages have significant differences in memory layout, byte alignment, and parameter delivery conventions. If the model cannot accurately describe these differences, buffer overflows or data truncation loopholes due to ABI incompatibilities cannot be detected. Therefore, the invention provides a multi-language program structure unified modeling method and system based on semantic preservation. Disclosure of Invention The invention aims to provide a unified modeling method and a unified modeling system for a multi-language program structure based on semantic preservation, which realize deep unified modeling of Java, rust, C/C++, go and other heterogeneous codes and effectively solve the problems of semantic loss and boundary fault in cross-language analysis by constructing a unified Structured Semantic Graph (SSG) and a multi-language unified type system (MLUTS). According to the first aspect of the invention, in order to achieve the above object, the present invention provides a method for unified modeling of a multi-language program structure based on semantic preservation, comprising the steps of: Receiving source codes of multiple programming languages, generating a corresponding original abstract syntax tree by utilizing a front-end parser configured with a declarative mapping framework DSMF, and converting the original abstract syntax tree AST into initial nodes and edges of a unified structural semantic graph SSG according to a predefined mapping rule; receiving type information of multiple programming languages, mapping the type information into unified five-tuple semantic vectors by utilizing a multi-language unified type system MLUTS, annotating the semantic vectors as attributes on corresponding nodes of the structured semantic graph SSG, and realizing semantic alignment of cross-language types; wherein the five-tuple semantic vector explicitly describes the basic type, ownership semantics, life cycle constraints, memory layout features and concurrency state of the data; identifying an FFI call point and a definition point of an external function interface in a source program, calculating a memory layout fingerprint of a data structure according to an ABI specification of an application program binary interface of a target running environment, constructing an FFI descriptor, and generating bridging edges connected with different language subgraphs in an SSG; And constructing a control flow edge, a data flow edge and a calling edge on the basis of the initial node and the bridging edge of the structured semantic graph SSG to form a complete program structure model comprising a physical layer, a logical layer and a semantic layer. Further, th