CN-122020671-A - Source code vulnerability detection method and device based on instruction perception, electronic equipment and storage medium

CN122020671ACN 122020671 ACN122020671 ACN 122020671ACN-122020671-A

Abstract

The invention discloses a source code vulnerability detection method, device, electronic equipment and storage medium based on instruction perception, and belongs to the technical field of vulnerability analysis, wherein the method comprises the steps of obtaining a source code to be detected, splicing a preset task instruction prefix with the source code, and generating a task perception input sequence; the method comprises the steps of inputting a task perception input sequence into a preset vulnerability feature extraction model to generate a source code semantic feature vector, searching a historical vulnerability vector with similarity meeting preset conditions from a preset vulnerability knowledge base based on the source code semantic feature vector to obtain a corresponding vulnerability mode label and a historical vulnerability analysis report, constructing a detection prompt word context according to the task perception input sequence, the vulnerability mode label and the historical vulnerability analysis report, inputting the detection prompt word context into a preset generation type large language model, and generating a source code vulnerability detection result. By implementing the method and the device, the problem that code loopholes are not detected accurately enough in the prior art can be solved.

Inventors

LI YAN
XU SIYAO
TAN SHUFENG
ZHAN CONGCONG
NONG CAIYAN
ZHANG KAI
CHEN YUZE
ZHOU GANG
XIE SHANYI

Assignees

广东电网有限责任公司电力科学研究院

Dates

Publication Date: 20260512
Application Date: 20260212

Claims (10)

1. The source code vulnerability detection method based on instruction perception is characterized by comprising the following steps: Acquiring a source code to be detected; Splicing a preset task instruction prefix and a source code to be detected to generate a task perception input sequence, wherein the task instruction prefix is a natural language text used for guiding generation of a source code semantic feature vector; Inputting the task perception input sequence into a preset vulnerability feature extraction model, so that the vulnerability feature extraction model generates a source code semantic feature vector according to the task perception input sequence; According to the source code semantic feature vectors, historical vulnerability vectors with similarity meeting preset conditions between the source code semantic feature vectors are retrieved from a preset vulnerability knowledge base and used as candidate vulnerability vectors; based on each candidate vulnerability vector, extracting vulnerability pattern labels and historical vulnerability analysis reports corresponding to each candidate vulnerability vector from a preset vulnerability knowledge base; The method comprises the steps of constructing a detection prompt word context according to a task perception input sequence, vulnerability mode labels corresponding to candidate vulnerability vectors and historical vulnerability analysis reports corresponding to the candidate vulnerability vectors, inputting the detection prompt word context into a preset generation type large language model, and enabling the generation type large language model to generate a source code vulnerability detection result according to the detection prompt word context.
2. The method for detecting source code vulnerabilities based on instruction awareness according to claim 1, wherein splicing a preset task instruction prefix and source code to be detected to generate a task awareness input sequence comprises: Taking a preset task instruction prefix as a guide text, placing the guide text in front of a source code to be detected, and inserting a preset separation identifier between the task instruction prefix and the source code to be detected; And carrying out text merging according to the sequence of the task instruction prefix, the separation identifier and the source code to be detected, and generating a task perception input sequence.
3. The instruction awareness based source code vulnerability detection method of claim 2, wherein the vulnerability feature extraction model is trained by: Obtaining a vulnerability training sample set, wherein the vulnerability training sample set comprises a plurality of vulnerability training samples, and each vulnerability training sample comprises a query code, a corresponding positive sample and a corresponding difficult negative sample; For each vulnerability training sample, splicing a preset task instruction prefix and a current query code to generate a current task perception query sequence, splicing the preset task instruction prefix and a current positive sample to generate a current task perception positive sample sequence, splicing the preset task instruction prefix and a current difficult negative sample to generate a current task perception difficult negative sample sequence, and taking the current task perception query sequence, the current task perception positive sample sequence and the current task perception difficult negative sample sequence as training task perception sequence samples; Constructing a training task perception sequence set according to each training task perception sequence sample; Dividing a training task perception sequence set into a plurality of batches of training samples according to the preset batch size; repeatedly executing the circulation process until the preset training times are reached, and generating a vulnerability feature extraction model; The cyclic process includes: Sequentially inputting each training task perception sequence sample in a current batch of training samples into a current vulnerability feature extraction model, and outputting a source code semantic feature vector corresponding to each training task perception sequence sample in the current batch of training samples, wherein the source code semantic feature vector comprises a query vector corresponding to a task perception query sequence, a positive sample vector corresponding to a task perception positive sample sequence and a difficult negative sample vector corresponding to a task perception difficult negative sample sequence; for each training task perception sequence sample in the current batch of training samples, inputting a task perception query sequence and a task perception difficulty negative sample sequence of the current training task perception sequence sample into a preset teacher model, so that the teacher model carries out vulnerability logic similarity scoring on the task perception query sequence and the task perception difficulty negative sample sequence of the current training task perception sequence sample, and penalty weights corresponding to the current training task perception sequence sample are generated; according to the query vector, the positive sample vector, the difficult negative sample vector and the corresponding punishment weight corresponding to each training task perception sequence sample in the training samples of the current batch, calculating and generating a loss function value through a preset weighting loss function; optimizing a current vulnerability characteristic extraction model according to the loss function value based on an optimizer, and generating an optimized vulnerability characteristic extraction model; Judging whether the current training times reach the preset training times, if so, taking the optimized vulnerability feature extraction model as a final vulnerability feature extraction model, selecting the next batch of training samples as the current batch of training samples, and if not, updating the optimized vulnerability feature extraction model as the current vulnerability feature extraction model.
4. The instruction awareness based source code vulnerability detection method of claim 3, wherein the difficult negative sample of any vulnerability training sample is obtained by: Obtaining a candidate code library, wherein each candidate code in the candidate code library has a corresponding vulnerability pattern label, and the vulnerability pattern label is used for representing the vulnerability type of the code; according to the vulnerability pattern labels corresponding to the candidate codes, target negative candidate codes with different vulnerability pattern labels with the query codes are screened out from the candidate code library; Inputting the query code into a preset universal code embedded model, so that the universal code embedded model generates a reference semantic vector of the query code according to the query code; Sequentially inputting each target negative candidate code into a preset universal code embedded model, so that the universal code embedded model generates a comparison semantic vector of each target negative candidate code according to each target negative candidate code; Calculating the similarity of the reference semantic vector and the contrast semantic vector of each target negative candidate code, and generating a similarity score of each target negative candidate code; Sorting the similarity scores of the target negative candidate codes in a descending order to generate a sorting result; and selecting target negative candidate codes ranked in preset bit times from the sorting results as difficult negative samples.
5. The instruction awareness based source code vulnerability detection method of claim 4, wherein the positive sample of any vulnerability training sample is obtained by: Screening target forward candidate codes with the same vulnerability pattern labels as the query codes from the candidate code library according to the vulnerability pattern labels corresponding to the candidate codes; Sequentially inputting each target forward candidate code into a preset universal code embedding model, so that the universal code embedding model generates a contrast semantic vector of each target forward candidate code according to each target forward candidate code; calculating the similarity of the reference semantic vector and the contrast semantic vector of each target forward candidate code, and generating a similarity score of each target forward candidate code; and selecting the target forward candidate codes with highest similarity score and similarity less than 1 from the target forward candidate codes as positive samples.
6. The instruction awareness based source code vulnerability detection method of claim 5, wherein the preset teacher model is trained by: The method comprises the steps of obtaining a teacher model training set, wherein the teacher model training set comprises a plurality of groups of vulnerability comparison chains, and each group of vulnerability comparison chains comprises a first code segment with a vulnerability, a second code segment after repairing the first code segment and a corresponding vulnerability logic analysis report; based on a preset teacher basic model, performing differential analysis on the first code segment and the second code segment to generate vulnerability characteristic logic differences; Taking the vulnerability characteristic logic difference and the vulnerability logic analysis report as supervision signals, performing supervised fine tuning on a preset teacher base model to obtain a teacher model, wherein the teacher model is used for outputting punishment weights representing logic similarity between task perception query sequences and task perception difficulty negative sample sequences.
7. The instruction awareness based source code vulnerability detection method of claim 6, wherein the pre-set generative large language model is trained by: each generated training sample comprises a simulation detection instruction, a code instance to be detected, similar code fragments obtained through retrieval, a corresponding historical vulnerability analysis report and a standard expert detection conclusion; The simulation detection instruction, the code instance to be detected, the similar code segments and the historical vulnerability analysis report are subjected to templated splicing, and an input prompt word sequence is constructed; And taking the input prompt word sequence as input, taking a standard expert detection conclusion as a target to output, and performing supervised learning fine adjustment on a preset large language model to generate a generated large language model, wherein the generated large language model is used for sensing the input sequence according to an input task and generating a detection result containing vulnerability type positioning and vulnerability repairing suggestions.
8. The source code vulnerability detection device based on instruction perception is characterized by comprising a data acquisition module, a feature extraction module, a vector retrieval module and a vulnerability detection module; The data acquisition module is used for acquiring a source code to be detected; the feature extraction module is used for splicing a preset task instruction prefix and a source code to be detected to generate a task perception input sequence, wherein the task instruction prefix is a natural language text used for guiding generation of a source code semantic feature vector; The vector retrieval module is used for retrieving historical vulnerability vectors with similarity meeting preset conditions from a preset vulnerability knowledge base according to the source code semantic feature vectors, and taking the historical vulnerability vectors with similarity meeting preset conditions as candidate vulnerability vectors; the vulnerability detection module is used for constructing a detection prompt word context according to the task perception input sequence, the vulnerability mode labels corresponding to the candidate vulnerability vectors and the historical vulnerability analysis reports corresponding to the candidate vulnerability vectors, inputting the detection prompt word context into a preset generation type large language model, and enabling the generation type large language model to generate a source code vulnerability detection result according to the detection prompt word context.
9. An electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the instruction awareness based source code vulnerability detection method of any one of claims 1-7 when executing the computer program.
10. A storage medium comprising a stored computer program, wherein the computer program, when run, controls a device in which the storage medium is located to perform the instruction awareness based source code vulnerability detection method of any one of claims 1-7.

Description

Source code vulnerability detection method and device based on instruction perception, electronic equipment and storage medium Technical Field The invention relates to the technical field of vulnerability analysis, in particular to a source code vulnerability detection method and device based on instruction perception, electronic equipment and a storage medium. Background Source code vulnerability detection is a core link for guaranteeing software supply chain security and preventing network attacks. With the explosive growth of software system scale, it has been difficult for human audit code logic to meet the need to efficiently discover potential security risks. The automatic vulnerability detection technology can quickly identify known vulnerability patterns in a large-scale code corpus, so that the repair cost is reduced in the early stage of a software development life cycle, and serious economic loss and privacy data leakage caused by the malicious exploitation of the vulnerability are avoided, and the method has extremely important significance for constructing a network security protection system. However, the existing code vulnerability detection method still faces the problem of insufficient accuracy of detection results in application. When the existing detection flow acquires the reference information, the guiding processing of the code to be detected under the specific detection intention is lacking. When different security audit targets are faced, the detection system cannot dynamically adjust the extraction key points of the code features according to specific detection instructions, so that the historical data recalled in the retrieval stage only has similarity in the aspect of code appearance or general functions, but is completely misplaced with the intention to be detected in the aspect of deep security logic. The search result lacking task perception enables the subsequent generation link to acquire a large amount of interference information irrelevant to the current detection intention, so that effective evidence support cannot be provided for vulnerability determination, and accurate analysis of vulnerability characteristics under a complex code background is limited. Disclosure of Invention The embodiment of the invention provides a source code vulnerability detection method and device based on instruction perception, electronic equipment and a storage medium, which can solve the problem that in the prior art, code vulnerability detection is inaccurate. The embodiment of the invention provides a source code vulnerability detection method based on instruction awareness, which comprises the following steps: Acquiring a source code to be detected; Splicing a preset task instruction prefix and a source code to be detected to generate a task perception input sequence, wherein the task instruction prefix is a natural language text used for guiding generation of a source code semantic feature vector; Inputting the task perception input sequence into a preset vulnerability feature extraction model, so that the vulnerability feature extraction model generates a source code semantic feature vector according to the task perception input sequence; According to the source code semantic feature vectors, historical vulnerability vectors with similarity meeting preset conditions between the source code semantic feature vectors are retrieved from a preset vulnerability knowledge base and used as candidate vulnerability vectors; based on each candidate vulnerability vector, extracting vulnerability pattern labels and historical vulnerability analysis reports corresponding to each candidate vulnerability vector from a preset vulnerability knowledge base; The method comprises the steps of constructing a detection prompt word context according to a task perception input sequence, vulnerability mode labels corresponding to candidate vulnerability vectors and historical vulnerability analysis reports corresponding to the candidate vulnerability vectors, inputting the detection prompt word context into a preset generation type large language model, and enabling the generation type large language model to generate a source code vulnerability detection result according to the detection prompt word context. Further, splicing a preset task instruction prefix and a source code to be detected to generate a task perception input sequence, including: Taking a preset task instruction prefix as a guide text, placing the guide text in front of a source code to be detected, and inserting a preset separation identifier between the task instruction prefix and the source code to be detected; And carrying out text merging according to the sequence of the task instruction prefix, the separation identifier and the source code to be detected, and generating a task perception input sequence. Further, the vulnerability feature extraction model is trained by: Obtaining a vulnerability training sample set, wherein the vulnerabili