CN-122019394-A - Program defect detection method based on semantic structure double-flow cooperation and noise immunity fusion

CN122019394ACN 122019394 ACN122019394 ACN 122019394ACN-122019394-A

Abstract

The invention is suitable for the technical field of software defect detection and vulnerability detection, provides a program defect detection method based on semantic structure double-flow cooperation and noise immunity fusion, and aims to extract structural elements of function-level source codes through a large language model, generate logic reasoning paths (data flow, control dependence and key semantics) containing data flow chains, control dependence and key variable semantics, quantitatively extract key logic features related to vulnerabilities through a layering bottleneck fusion mechanism, realize cross-flow information alignment and aggregation, and further improve the detection capability of complex logic defects. The invention solves the problems that the existing deep learning model only focuses on the code lexical structure and lacks of deep logic reasoning capability, the traditional structure diagram has high construction cost and complex engineering, the structural information redundancy is easy to occur, and the stable reproduction is difficult in a limited time.

Inventors

BAO TIE
HUANG YUYAN
PENG TAO
WANG SHANG
LIU LU

Assignees

吉林大学

Dates

Publication Date: 20260512
Application Date: 20260331

Claims (6)

1. The program defect detection method based on the semantic structure double-flow cooperation and noise immunity fusion is characterized by comprising the following steps: (1) Performing data processing by adopting a double-flow input construction method consisting of a source code sequence and a symbolized logic description; (2) Taking a source code sequence as a first input stream, taking a symbolic logic description as a second input stream, forming a sample-level double-stream input pair, and adopting a pre-training language model to respectively encode the source code stream and the structure code stream to obtain two-path context representation sequences; (3) Feature interaction is carried out based on a hierarchical bottleneck fusion module, a bottleneck query sequence is initialized in a source code sequence hidden state, information bottlenecks are realized by layer-by-layer interception and compression, information is alternately aggregated from a source code stream and a structure complementary stream, and a fused global representation vector is obtained; (4) And inputting the integrated global representation vector into a linear classifier to perform classification output to obtain a prediction result of whether the code has defects.
2. The method for detecting program defects based on semantic structure double-flow cooperation and noise immunity fusion according to claim 1, wherein the step of performing data processing by adopting a double-flow input construction method consisting of a source code sequence and a symbolized logic description comprises the following steps: Preprocessing and serializing source codes, namely performing standard preprocessing on an original C language function code segment, and converting codes into Token ID sequences by using a word segmentation device of a pre-training model; Structured knowledge extraction and filtering, namely generating structured logic description for each code segment by using a large language model as a knowledge extractor and outputting the structured logic description in a JSON format.
3. The method for detecting program defects based on semantic structure double-flow collaboration and noise immunity fusion according to claim 2, wherein the step of extracting and filtering structured knowledge specifically comprises the following steps: The multi-dimensional structure extraction, namely, an instruction LLM extracts information of three key dimensions, including a data stream chain, control dependence and key semantics; And extracting an identifier set from the structural description generated by LLM to carry out word face matching, if the structural description does not contain a resolvable identifier or the identifier set does not intersect with the source code identifier set, judging that the structural description is not verifiable, and uniformly backing the structural field of the sample to a preset spam.
4. The method for detecting program defects based on semantic structure double-flow cooperation and noise immunity fusion according to claim 2, wherein the method is characterized in that a source code sequence is used as a first input stream, a symbolic logic description is used as a second input stream, a sample-level double-flow input pair is formed, a pre-training language model is adopted to encode the source code stream and the structure code stream respectively, and two-path context expression sequences are obtained, and specifically comprises the following steps: a generic encoder that utilizes a pre-trained language model as a source code stream and a structural code stream, wherein: the code coding stream is input into a preprocessed source code sequence, and grammar and semantic dependence inside codes are captured through a multi-layer self-attention mechanism by utilizing the context modeling capability of a pre-training language model, so that a source code sequence representation containing complete context information is obtained; the structure coded stream is input with a linearized structure description sequence to obtain a global representation as a structure.
5. The method for detecting program defects based on semantic structure double-flow collaboration and anti-noise fusion according to claim 4, wherein the method is characterized in that feature interaction is performed based on a hierarchical bottleneck fusion module, a bottleneck query sequence is initialized in a source code sequence hidden state, information bottlenecks are realized by layer-by-layer interception and compression, information is alternately aggregated from a source code stream and a structure complementary stream, and a fused global representation vector is obtained, the hierarchical bottleneck fusion module adopts a sequence representation of the source code stream to initialize a group of bottleneck query vectors, and in each layer of interaction, the query sequence is intercepted and compressed, so that the query sequence aggregates cross-flow key information layer by layer and suppresses redundant noise, and each layer of interaction comprises two types of operations: The bottleneck Query vector is used as Query, cross attention operation is respectively carried out on a source code stream and a structure supplement stream, and the bottleneck vector actively retrieves and gathers the feature fragments most relevant to the defect detection task from the two modes through the self-adaptive distribution of attention weights; the two original sequences are used as Query, bottleneck inquiry is used as memory, information update is carried out through cross attention, and two paths of representations are aligned under bottleneck intermediation; after multi-layer interaction, a fusion vector formed by splicing three parts is output, wherein the fusion vector comprises a first Token representation of a bottleneck query, an updated source code sequence first Token representation and an updated structure supplement sequence first Token representation.
6. The program defect detection method based on semantic structure double-flow collaboration and noise immunity fusion of claim 1, wherein the program defect detection method further comprises introducing a multi-task joint loss function, comprising: Based on the output fusion vector, outputting defect prediction probability through a fully-connected classification layer, calculating the difference between a predicted value and a real label by adopting a cross entropy loss function for weighting a positive class sample, and directly guiding a model to learn a defect classification boundary; The auxiliary loss of structural alignment is realized by a nonlinear projection module consisting of a multi-layer perceptron.

Description

Program defect detection method based on semantic structure double-flow cooperation and noise immunity fusion Technical Field The invention belongs to the technical field of software defect detection and vulnerability detection, and particularly relates to a program defect detection method based on semantic structure double-flow cooperation and noise resistance fusion. Background Software defect detection and vulnerability detection are one type of important application in the fields of software engineering and network security, and aim to predict defect types of code fragments based on the content of program codes. The existence of software defects or loopholes can not only cause huge economic loss, but also cause catastrophic safety accidents, so how to automatically and accurately identify potential defect modules, namely software defect detection, at the early stage of the software development life cycle is a key subject to be overcome in the field of software engineering. Performing defect detection on source code based on deep learning is an important and efficient technical route. The source code has a sequence structure naturally, so a pre-training language model is often adopted in research to encode the code sequence and realize defect prediction through a classifier, and meanwhile, the source code has a structure that the code is expressed as an abstract syntax tree, a control flow/data flow graph or a code attribute graph and the like, and information is transmitted between nodes by combining a graph neural network to capture long-distance dependence. However, studies have pointed out that when only code sequences are used as input, the model still has difficulty in stably understanding sentence semantics comprising complex logic, pointer operation and multiple operators and fully capturing execution sequences and structural information of codes, while a graph structure-based method enhances structural perception, a complex composition process often brings huge calculation and time expenditure, and a graph neural network is easy to generate an excessive smoothing problem during deep propagation, so that node characterization convergence is caused, and fine-granularity deep semantics are difficult to effectively capture, thereby influencing defect detection effects. Based on the method, the method starts to try to introduce auxiliary information outside the source code, and overcomes the problem that the pre-training model is insufficient in modeling the execution semantics and the structural information. Along with the improvement of the capability of the large language model in terms of code understanding, researchers begin to explore external semantic information generated by the large language model to enhance defect detection, such as annotation generation, abstract generation and fusion with source code features, so as to improve the capturing capability of the model on vulnerability patterns, and research and design special fusion mechanisms are also available to enable natural language interpretation to participate in attention calculation and form complementary semantics with code representation. Meanwhile, related researches also observe that when zero sample loopholes are judged by directly using a large language model, the output is possibly unstable, the tendency of judging most samples as loopholes exists is generated, and in addition, the free text output of the large language model also possibly contains redundant or difficultly normalized content, so that the problems of inconsistent noise amplification and distribution and the like of a downstream fusion model are caused. Therefore, how to stably develop the capability of a large language model under the conditions of sustainable cost and reproducible evaluation of engineering becomes a key for further research and engineering landing in the direction. Because of the excellent performance of Large Language Models (LLMs) in the field of code understanding, students have begun to try to introduce semantic information generated by the large language models into vulnerability detection tasks, but research on deep fusion of code logic and LLM generated content is still in a starting stage, and research done by Wen et al proposes a "structured natural language annotation tree" which attempts to fuse annotation semantics and original code information by mounting annotations generated by the large models on abstract syntax tree nodes of codes, however, the scheme fails to utilize substantial logic structure information and lacks noise filtering and feature compression mechanisms. In view of the prior art, the shortcomings of the current research include: The engineering cost of completing vulnerability discrimination directly by using a large language model is high, the output stability is insufficient, more and more methods in recent years guide the large language model to directly give out vulnerability classification conclusion in th