CN-122020675-A - Automatic generating method and system for vulnerability static analysis evaluation data for MPI program

CN122020675ACN 122020675 ACN122020675 ACN 122020675ACN-122020675-A

Abstract

The invention discloses an automatic generating method and system of vulnerability static analysis and evaluation data for an MPI program, and belongs to the technical field of software testing. Aiming at the problem that the existing MPI program static analysis tool lacks a high-quality evaluation data set, the invention adopts a program mutation method based on an MPI vulnerability pattern knowledge base driver, designs a semantically maintained vulnerability injection operator by constructing a formal MPI vulnerability pattern description system, and combines a lightweight sign execution technology facing the MPI program to perform vulnerability reachability verification so as to realize automatic generation and labeling of MPI vulnerability samples. The evaluation data set generated by the method has the characteristics of comprehensive vulnerability type coverage, accurate vulnerability labeling, high sample diversity and strong traceability, and provides important support for testing, evaluating and improving the MPI program static analysis tool.

Inventors

CHEN QIANGPU
Pan Zulie
LI YUWEI
ZHAO JUN
HU MIAO
LI YANG
WANG RUIPENG
CHEN ZIYU

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. An automatic generating method of vulnerability static analysis evaluation data for an MPI program is characterized by comprising the following steps: S1, constructing an MPI source code library and a vulnerability model knowledge base at an input layer; S2, calling a code preprocessing module, performing lexical and grammatical analysis on the MPI source codes in an MPI source code library, and constructing an abstract grammar tree AST to identify API calls in programs corresponding to the MPI source codes, so as to extract parameter information of communication operation, and further performing control flow analysis and data flow analysis to serve as program analysis results; S3, calling a vulnerability pattern matching module, loading formal description of a target vulnerability pattern from a vulnerability pattern knowledge base, carrying out matching detection on a program analysis result and the formal description of the target vulnerability pattern by using an injection point detector so as to determine a code position meeting a vulnerability precondition, and executing constraint inspection; S4, calling a program mutation module, selecting a mutation operator according to the vulnerability type of the target vulnerability mode, executing code mutation operation at an injection point corresponding to a code position by using the mutation operator, and checking grammar correctness and MPI semantic consistency of the mutated program code by a semantic verifier; S5, calling a reachability verification module, based on a lightweight symbol execution technology, combining an MPI semantic model and a constraint solver, and executing path analysis on mutated program codes to obtain a feasible path from a program entry to a vulnerability point, so as to verify the reachability of the injection vulnerability and determine the triggering condition of the injection vulnerability under specific execution conditions; S6, calling a data set management module, storing the verified vulnerability samples into a data set, generating vulnerability labeling information, evaluating quality indexes of the data set, and optimizing the data set based on a diversity selection strategy; s7, outputting the data set containing the marked loopholes at an output layer.
2. The automated vulnerability static analysis evaluation data generation method for the MPI program of claim 1 is characterized in that in S1, a vulnerability pattern knowledge base constructed in an input layer comprises four types of vulnerabilities, namely communication vulnerabilities, synchronization vulnerabilities, resource vulnerabilities and type vulnerabilities, and each type of vulnerability is stored in a formal description format, wherein formal description information comprises vulnerability pattern identifiers, vulnerability pattern names, vulnerability belongings, vulnerability preconditions, vulnerability triggering patterns, vulnerability positioning information and influence effects generated by the vulnerabilities.
3. The automatic generating method of vulnerability static analysis and assessment data for MPI program according to claim 2, wherein in S2: The lexical and grammatical analysis comprises using a front-end tool based on Clang to perform lexical and grammatical analysis on the MPI source code, converting a character stream of the MPI source code into a Token sequence, and grammatically analyzing to organize the Token sequence into a grammatical structure according to a C/C++ grammar rule; Constructing an abstract syntax tree AST, namely constructing the abstract syntax tree AST based on lexical and grammatical analysis results, wherein each node corresponds to one grammar component in a program and adds an attribute field for the node; Identifying API calls in a program corresponding to MPI source codes, and extracting parameter information of communication operation, wherein the parameter information comprises a communication type, a communication mode, a source/target process, a message tag, a data type, a communication domain, a buffer address and a buffer size; The control flow analysis comprises the steps of constructing a program control flow graph CFG for representing possible execution paths, wherein each node corresponds to a basic block, the basic blocks represent statement sequences which are continuously executed, edges represent control transfer relations among the basic blocks, and potential synchronous edges among processes are added in the program control flow graph CFG for programs corresponding to MPI source codes to represent inter-process dependency relations possibly introduced by MPI communication operation; the dataflow analysis includes definition of program control flow graph CFG computation variables-use chain and activity information to quantify the impact of vulnerability injection on program data dependencies.
4. The automated vulnerability static analysis evaluation data generation method for MPI program of claim 3, wherein in S3, the code location satisfying the vulnerability precondition is determined by matching detection, wherein: Converting the preconditions of formal description into executable matching rules by adopting a rule-based pattern matching method; the form of the matching rule is r= (S, C, a); S is a selector for locating candidate code positions, C is a checker for verifying whether the candidate code positions meet the pre-condition, and A is an action for recording the matching result.
5. The automated vulnerability analysis assessment data generation method for MPI program of claim 4, wherein in S3, the constraint check comprises a grammar constraint check, a semantic constraint check and a reachability constraint check.
6. The automated vulnerability analysis assessment data generation method for MPI program of claim 5, wherein in S4, performing code mutation operation comprises: Selecting an applicable mutation operator from a mutation operator library according to the vulnerability type and the injection point characteristics of the target vulnerability mode, wherein the selection strategy comprises the application condition, the expected effect and the history success rate of the mutation operator; adopting constraint random strategy to randomly select parameter values on the premise of meeting the constraint of the vulnerability mode; Performing code modification operation on the level of the abstract syntax tree AST, and adding, deleting and/or modifying nodes of the abstract syntax tree AST; compiling the mutated program code, checking whether the mutated program code has grammar errors, and if the compiling fails, re-executing code modification operation, re-selecting mutation operators or re-performing parameter configuration; MPI semantic consistency check is performed on the program code passing the grammar correctness check, and the check content comprises consistency of collective operations and validity of communication parameters.
7. The automated vulnerability analysis assessment data generation method for MPI programs of claim 6, wherein in S4, the mutation operator comprises a communication mutation operator, a synchronous mutation operator, a resource mutation operator and a type mutation operator.
8. The automated vulnerability static analysis evaluation data generation method for MPI program of claim 7, wherein in S5, verifying the accessibility of the injection vulnerability comprises: creating an initial symbolic execution state for the mutated program code, including a signed program input and a process state array; Constructing a symbol execution state space of multiple processes according to the number of program processes, wherein each process maintains an independent program counter and a local symbol state; selectively exploring different process interleaving execution sequences by adopting a partial order reduction strategy, and executing sentence sequences of each process for each interleaving symbol; When the MPI call is executed, semantic analysis is executed by utilizing an MPI semantic model, and the state of a message queue between processes is updated for communication operation; when the program is executed to the injection point, collecting the path constraint from the program entrance to the injection point, submitting the path constraint to a constraint solver, judging whether input values meeting all the constraints exist or not, if so, enabling the vulnerability to be reachable, otherwise, enabling the vulnerability to be unreachable; And for the inaccessible holes, marking the corresponding mutated program codes as invalid samples.
9. The automated vulnerability analysis and assessment data generation method for MPI program of claim 8, wherein in S5: the order reduction strategy is that if two operations are independent, only one order is executed; the independent operation comprises the operation among different communication domains, disjoint point-to-point communication and complete local calculation; in the partial order reduction strategy, the symbol execution state space is reduced by identifying independent operations and pruning equivalent interleaving.
10. An automatic vulnerability static analysis evaluation data generation system for an MPI program, the system comprising: The input module is configured to construct an MPI source code library and a vulnerability pattern knowledge library; the code preprocessing module is configured to perform lexical and grammatical analysis on the MPI source codes in the MPI source code library, and construct an abstract grammar tree AST to identify API calls in programs corresponding to the MPI source codes, so as to extract parameter information of communication operation, and further perform control flow analysis and data flow analysis as program analysis results; The vulnerability pattern matching module is configured to load formal descriptions of target vulnerability patterns from a vulnerability pattern knowledge base, perform matching detection on program analysis results and the formal descriptions of the target vulnerability patterns by using an injection point detector so as to determine code positions meeting vulnerability preconditions, and execute constraint checking; the program mutation module is configured to select a mutation operator according to the vulnerability type of the target vulnerability mode, execute code mutation operation at the injection point corresponding to the code position by utilizing the mutation operator, and check the grammar correctness and MPI semantic consistency of the mutated program code by a semantic verifier; The reachability verification module is configured to perform path analysis on mutated program codes based on a lightweight symbol execution technology in combination with an MPI semantic model and a constraint solver to acquire a feasible path from a program entry to a vulnerability point, so as to verify the reachability of the injection vulnerability and determine the triggering condition of the injection vulnerability under specific execution conditions; The database management module is configured to store the verified vulnerability samples into a database, generate vulnerability labeling information, evaluate quality indexes of the database and optimize the database based on diversity selection strategies; An output module configured to output the data set containing the marked vulnerability.

Description

Automatic generating method and system for vulnerability static analysis evaluation data for MPI program Technical Field The invention belongs to the technical field of software testing, and particularly relates to an automatic vulnerability static analysis evaluation data generation method and system for an MPI program. Background The message passing interface (MESSAGE PASSING INTERFACE, MPI) is the most widely used parallel programming standard in the field of high performance computing, and is widely used in the computationally intensive fields of scientific computing, meteorological simulation, molecular dynamics, artificial intelligence training, and the like. MPI enables efficient data exchange and collaborative computation by defining a standardized set of messaging semantics for multiple processes running on a distributed memory system. However, the parallel nature of the MPI procedure also presents unique security challenges. Unlike conventional serial programs, MPI programs involve message passing, synchronization, and coordination among multiple processes, which makes MPI programs potentially vulnerable to a variety of unique vulnerability types, including, but not limited to, deadlock (Deadlock) vulnerabilities, race Condition (Race Condition) vulnerabilities, buffer overflow vulnerabilities, message type mismatch vulnerabilities, resource leakage vulnerabilities, and the like. These vulnerabilities may not only cause program crashes or produce erroneous calculation results, but may also be exploited by malicious attackers, causing serious security consequences such as information leakage or the system being controlled. Therefore, the method has important theoretical value and practical significance for effective leak detection and analysis of the MPI program. Static analysis is a technique for discovering potential vulnerabilities and bugs by analyzing program source code or intermediate representations without actually executing the program. Compared with dynamic test, static analysis has the advantages of high coverage rate, no need of constructing test cases, capability of finding problems in early development, and the like. A variety of tools and methods have been developed in the academia and industry for static analysis of MPI programs. The main stream MPI static analysis tools comprise MUST, MPI-Checker, PARCOACH, ISP and the like. The tools adopt different analysis technologies, such as data flow analysis, abstract interpretation, model detection and the like, and can detect various vulnerability types such as deadlock, race conditions, parameter mismatch and the like in the MPI program. However, due to the complexity of the MPI program and the specificity of the parallel semantics, there are large differences in the detection capabilities and accuracy of these tools, and there is a lack of uniform, comprehensive evaluation criteria and benchmark datasets. The evaluation dataset (Benchmark Dataset) is a key resource for measuring the effectiveness of the static analysis tool. A high quality evaluation data set should contain a sufficient number of program samples to cover multiple vulnerability types and provide accurate vulnerability tagging information including vulnerability location, vulnerability type, vulnerability triggering conditions, etc. However, the current area of MPI program static analysis faces the dilemma of evaluating serious inadequacies of data sets, mainly in the following aspects. First, existing datasets are limited in size and narrow in coverage. The currently publicly available MPI vulnerability data sets are mainly derived from manually constructed samples in academic research, such as test case sets attached to MUST tools, verification samples of MPI-Checker, and the like. These datasets typically contain only tens to hundreds of program samples and mainly cover basic vulnerability types such as deadlocks and simple parameter errors, and are not enough for advanced vulnerability types such as complex race conditions, implicit type conversion errors, collective communication mismatch, etc. Second, manually constructing the dataset is costly and inefficient. Because of the complex parallel semantics of MPI programs, manually writing test cases with specific vulnerability patterns requires deep MPI programming experience and security knowledge by developers. From empirical estimation, a qualified MPI vulnerability sample typically takes hours to days from design to verification, which severely constrains the expansion speed of the evaluation dataset. Third, existing datasets lack systematicness and traceability. Manually constructed vulnerability samples are typically based on the personal experience of the developer, lacking systematic carding and formal description of the MPI vulnerability patterns. The generated data set may have the problems of repeated vulnerability patterns, unbalanced coverage, difficulty in tracing vulnerability root causes and the like,