CN-121980564-A - Malicious software family classification method and device based on large model data enhancement

CN121980564ACN 121980564 ACN121980564 ACN 121980564ACN-121980564-A

Abstract

The invention discloses a malware family classification method and device based on large model data enhancement. Firstly, acquiring a real API call sequence of a malicious software sample, generating a virtual API call sequence by using a large language model, and screening the virtual sequence through double thresholds. Mapping the real API call sequence and the API call sequence of the effective enhancement sample into numerical type feature vectors containing presence features and co-occurrence features, then utilizing an enhancement data set training classification model formed by mixing the feature vectors of the real API call sequence and the effective enhancement sample feature vectors, and utilizing the trained classification model to classify the malicious software family. The method can generate high-quality and logical virtual samples under the condition of few samples, effectively eliminates phantom samples and redundant samples, solves the problem of insufficient novel malicious software family samples, and remarkably improves the generalization capability and robustness of the classifier under complex scenes.

Inventors

WANG XINGQI
XIE ZHONGHAO

Assignees

杭州电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (10)

1. A malware family classification method based on large model data enhancement is characterized by comprising the following steps: s1, acquiring a real API call sequence of a malicious software sample; s2, generating a virtual API call sequence based on the real API call sequence by using the large language model, and calculating the maximum Jaccard similarity between the virtual sequence and the real sequence Only when the maximum Jaccard similarity of the virtual sequence is Retaining the sample within a preset threshold range as a valid enhancement sample; s3, constructing an API co-occurrence matrix, and mapping an API call sequence of a real API call sequence and an effective enhancement sample into a numerical feature vector containing presence features and co-occurrence features The presence feature is used for representing whether APIs exist in the calling sequence, and the co-occurrence feature is used for representing the association strength among APIs in the calling sequence; S4, training a classification model by utilizing an enhanced data set formed by mixing the feature vector of the real API call sequence and the feature vector of the effective enhanced sample, and classifying the malicious software family by utilizing the trained classification model.
2. The method for classifying the malware family based on the large model data enhancement of claim 1, wherein the method is characterized in that disassembly analysis is carried out on a malware sample, all imported dependent modules are traversed, reference functions in the dependent modules are enumerated, the reference function names are subjected to modification treatment, the reference function names are restored to standard API names, and a real API call sequence of the malware sample is obtained.
3. The method of claim 1, wherein the method is characterized in that the real API call sequence number of a plurality of malware samples of the malware family to be data-enhanced is randomly selected as a seed sample, a structured Prompt word Prompt comprising role positioning, task targets, reference contexts and output constraints is constructed, the structured Prompt word Prompt is sent to a large language model, and an API list is analyzed from text streams returned by the large language model to serve as candidate virtual sequences.
4. The method for classifying a family of malware based on large model data enhancement as recited in claim 3, wherein the roles are oriented to malware experts, the task targets are to synthesize a new variant API sequence of a specified family based on a reference context, the reference context is a seed sample, a high frequency API list of the specified family, and a core malicious behavior description of the specified family, and the output constraint is to introduce minor variations and output only a list of API names separated by spaces based on maintaining core malicious behavior logic.
5. The method for classifying a family of malware enhanced based on large model data as recited in claim 3, wherein the maximum Jaccard similarity between the candidate virtual sequence and the seed sample is the same as the maximum Jaccard similarity between the candidate virtual sequence and the seed sample And E, when the E [0.4,0.95] is the E, reserving the candidate virtual sequence as a valid enhancement sample, and otherwise, performing discarding treatment.
6. The method for classifying the malware family based on the large model data enhancement of claim 1, wherein all real API sequences and effective enhancement samples are scanned, APIs with the occurrence number more than 1 are counted, and an API-to-index dictionary is established; Initializing a size of Is a matrix of (a) Traversing all samples if API pairs appear in the same sample Then add the ith row and jth column element I, j=1, 2,..n, N is the size of the dictionary, and finally the matrix is used And (3) carrying out normalization processing on the elements in the matrix to obtain an API co-occurrence matrix.
7. The method of claim 6, wherein the presence feature is a 0/1 vector of length N, each element indicates whether the corresponding API appears in the sample, the co-occurrence feature is a floating point vector of length N, and each element indicates that the corresponding API and other APIs in the sample are in a co-occurrence matrix Is a mean co-occurrence value of (a).
8. The method for classifying a family of malware based on large model data enhancement as in claim 1, wherein the classification model is a multi-layer perceptron.
9. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-8.
10. The malware family classification device based on large model data enhancement is characterized by comprising: The static feature extraction module is used for carrying out disassembly analysis on the input malicious software sample and extracting a real API call sequence; The generating type enhancement module is used for calling the large language model to synthesize a virtual API sequence based on the real API call sequence and screening effective enhancement samples through Jaccard similarity verification; the feature embedding module is used for calculating an API co-occurrence matrix and converting an API sequence into a mixed feature vector containing presence features and co-occurrence features; And the classification module is used for training a classification model by utilizing a data set consisting of the real sample and the effective enhanced sample and executing a malicious software family classification task.

Description

Malicious software family classification method and device based on large model data enhancement Technical Field The invention belongs to the technical field of computers, relates to network security and artificial intelligence, and particularly relates to a malicious software family classification method and device based on large model data enhancement. Background With the rapid development of internet technology, the variety and quantity of malicious software (Malware) are explosively increased, and a serious threat is formed to network security. The accurate identification of the family to which the malware belongs (Family Classification) is critical to traceability analysis and formulation of targeted defense strategies. Traditional signature feature-based methods have difficulty coping with variant frequent malware. In recent years, deep learning-based malware classification techniques have made significant progress, but these methods typically rely on massive amounts of annotation data for training. In the practical application scenario, because many novel or Advanced Persistent Threat (APT) malicious family samples are extremely scarce, training samples are insufficient, the traditional deep learning model is easy to be fitted, the generalization capability is poor, and the correct family cannot be classified when facing to new varieties. To address the data starvation problem, generation of new samples by the generation of a countermeasure network (GAN) or simple random noise injection may be employed. However, GAN training is unstable and prone to pattern collapse, while random noise may disrupt the logical structure of API calls, resulting in the generated samples being non-executable or losing the original malicious behavior features. In addition, how to control the quality of the generated samples, and prevent the generated data from containing a large amount of phantom data completely inconsistent with the real distribution or simply copying the repeated data of the existing samples is also a technical problem to be solved currently. Thus, there is a need for a method that enables malware family classification with very few sample conditions. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a method and a device for classifying a malicious software family based on large model data enhancement, which utilize advanced large language model technology to generate a virtual sample with high quality and logic property and evaluate the quality of the sample, thereby solving the problem of low classification accuracy caused by scarce malicious software sample and low quality of the traditional data enhancement method. In a first aspect, the present invention provides a method for classifying malware families based on large model data enhancement, the method comprising: s1, performing disassembly analysis on a malicious software sample, extracting static characteristics, and obtaining a real API call sequence. S2, taking the real API call sequence as a seed sample, generating a virtual API call sequence by using a large language model, and calculating the maximum Jaccard similarity between the virtual sequence and the real sequence. Setting a lower thresholdAnd an upper thresholdOnly whenAnd when the virtual sample is reserved as a valid enhancement sample. S3, constructing an API co-occurrence matrix, and mapping an API call sequence of a real API call sequence and an effective enhancement sample into a numerical feature vector containing presence features and co-occurrence features。 The dimensions of the presence feature and the co-occurrence feature are the same as the size of an API dictionary, wherein an element in the presence feature is 1 to indicate that the API exists, and 0 to indicate that the API does not exist, and the element in the co-occurrence feature is an average co-occurrence value of the corresponding API and other APs in a sample in a co-occurrence matrix and reflects the association strength among the APs. S4, training a classification model by utilizing an enhanced data set formed by mixing the feature vector of the real API call sequence and the feature vector of the effective enhanced sample, and classifying the malicious software family by utilizing the trained classification model. In a second aspect, the present invention provides a malware family classification device based on large model data enhancement, comprising: The static feature extraction module is used for carrying out disassembly analysis on the input malicious software sample and extracting a real API call sequence; The generating type enhancement module is used for calling the large language model to synthesize a virtual API sequence based on the real API call sequence and screening effective enhancement samples through Jaccard similarity verification; the feature embedding module is used for calculating an API co-occurrence matrix and converting an API sequence into a mixe