CN-122027190-A - Malicious code tampering and counterfeiting identification method, system, storage medium and electronic equipment

CN122027190ACN 122027190 ACN122027190 ACN 122027190ACN-122027190-A

Abstract

The invention provides a malicious code tampering counterfeit identification method, a malicious code tampering counterfeit identification system, a storage medium and electronic equipment. Based on identity information of a sample to be tested, screening a candidate normal sample set matched with the identity of the sample to be tested from a preset normal sample library, calculating similarity between technical stack information of the sample to be tested and technical stack information of each sample in the candidate normal sample set, judging that the sample to be tested is a tampered sample if the similarity is larger than a preset threshold value due to the existence of the candidate normal sample, rejecting the sample to be tested to store the characteristics of the sample to be tested in a malicious characteristic library, and otherwise, allowing the characteristics of the sample to be tested to store. Through the two-dimensional cross verification of the identity of the developer and the technical stack environment, the tampered sample is accurately identified before the malicious features enter the warehouse, the problem of feature pollution caused by the fact that normal features enter the malicious warehouse by mistake is effectively avoided, and the false alarm rate is reduced from the source.

Inventors

KONG DERUI
LI SHILEI
ZHAO CHAO
XIAO XINGUANG

Assignees

安天科技集团股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251218

Claims (9)

1. The malicious code tampering and counterfeiting identification method is characterized by comprising the following steps of: Carrying out static analysis on a sample to be tested, and extracting identity information and technical stack information of the sample to be tested, wherein the identity information is used for representing the identity of a developer of the sample, and the technical stack information is used for representing the construction and operation environment characteristics of the sample; screening a candidate normal sample set matched with the identity of the sample to be detected from a preset normal sample library based on the identity information of the sample to be detected; calculating the similarity between the technical stack information of the sample to be detected and the technical stack information of each sample in the candidate normal sample set; If the candidate normal samples exist so that the similarity is larger than a preset threshold, judging that the sample to be tested is a tampered sample, refusing to put the characteristics of the tampered sample into a malicious characteristic library, and otherwise, allowing the characteristics of the sample to be tested to be put into a library.
2. The method of claim 1, wherein the identity information is an identity digest generated by an irreversible hash algorithm or a custom digest algorithm for uniquely identifying a developer source of the sample.
3. The method of claim 1, wherein the technical stack information includes at least one of compiler type, linker configuration, system SDK version, runtime library, third party dependency library, build script parameters.
4. The method of claim 1, wherein the similarity is calculated using a Jaccard similarity formula: Wherein, the Technical stack information representing the sample to be tested, Technical stack information representing candidate normal samples.
5. The method of claim 1, wherein said refusing to put features thereof into a malicious feature library comprises: Generating an evidence log containing identity abstracts and technical stack difference points for the samples judged to be tampered; triggering a feature warehouse-in interception instruction to prevent the sample feature from entering a malicious feature warehouse; and marking the sample as a suspicious tampered sample and transferring the suspicious tampered sample to a manual auditing queue for secondary verification.
6. The method of claim 1, implemented by a three-tier modular architecture, comprising: The preprocessing module is used for carrying out standardized processing on the sample to be detected and extracting identity information and technical stack information; the primary screening module is used for carrying out equivalence or approximate matching in a preset normal sample library based on the identity information to generate a candidate normal sample set; and the fine screening module is used for calculating the similarity between the technical stack information of the sample to be detected and the technical stack information of each sample in the candidate normal sample set, and outputting a final judging result.
7. The malicious code tampering and counterfeiting identification system is characterized by being integrated in a malicious sample analysis platform and used for automatically executing tampering detection before feature warehousing and outputting a judging result to a feature management module.
8. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the method of any one of claims 1-6.
9. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 8.

Description

Malicious code tampering and counterfeiting identification method, system, storage medium and electronic equipment Technical Field The present invention relates to the field of network security, and in particular, to a method and system for recognizing tampering and impersonation of malicious codes, a storage medium, and an electronic device. Background With the increasing complexity of network information security threats, the field of malware detection has long relied on automated feature extraction and whitelist mechanisms to distinguish normal samples from malicious samples. The existing detection system builds a malicious feature library by extracting static structural features such as sample code signatures, character string patterns, behavior tracks and the like, and meanwhile, the malicious software is exempted by relying on a trusted white list database. However, in recent years, attack technology has continuously evolved, and a large number of malicious samples tampered with based on legal white samples have emerged. The sample is implanted with malicious functions by means of code injection, logic modification or resource tampering and the like on the trusted software, but still retains most of normal characteristics of the original software. The tampered sample is highly similar to the original legal sample in the dimension of the static characteristic, so that the conventional automatic characteristic extraction mechanism is extremely easy to misjudge the normal characteristic carried by the tampered sample as a malicious characteristic and record and store the malicious characteristic under the condition that the tampering behavior is not accurately identified. And further, the problem of pollution of the feature library is caused, namely, in the subsequent detection process, the original upgrade package, patch or homologous module of the same software is wrongly identified as a malicious file, so that the false alarm rate is increased, and the reliability of the detection system is reduced. Currently, the industry is not provided with a mature technology capable of effectively distinguishing 'independently developed malicious programs' from 'malicious samples tampered based on legal software'. The existing method focuses on abnormality detection with single dimension such as behavior monitoring or structural variation, and fails to systematically verify identity authenticity of a developer of a sample and consistency of technical environment. With the popularization of software supply chain attacks, attackers are more prone to malicious code implantation by using a trusted developer identity shell, so that the traditional white list maintenance and malicious sample tracing mechanism face serious challenges. Therefore, a technology capable of judging the credibility and consistency of the sample before the storage of the malicious features is needed, so that the pollution of the tampered sample to the feature library is fundamentally blocked, and the robustness and the accuracy of the whole detection system are improved. Disclosure of Invention Aiming at the technical problems, the invention adopts the following technical scheme: according to one aspect of the present application, there is provided a malicious code tamper-imitation recognition method including the steps of: Carrying out static analysis on a sample to be tested, and extracting identity information and technical stack information of the sample to be tested, wherein the identity information is used for representing the identity of a developer of the sample, and the technical stack information is used for representing the construction and operation environment characteristics of the sample; screening a candidate normal sample set matched with the identity of the sample to be detected from a preset normal sample library based on the identity information of the sample to be detected; calculating the similarity between the technical stack information of the sample to be detected and the technical stack information of each sample in the candidate normal sample set; If the candidate normal samples exist so that the similarity is larger than a preset threshold, judging that the sample to be tested is a tampered sample, refusing to put the characteristics of the tampered sample into a malicious characteristic library, and otherwise, allowing the characteristics of the sample to be tested to be put into a library. Further, the identity information is an identity digest generated by an irreversible hash algorithm or a custom digest algorithm, and is used for uniquely identifying a developer source of the sample. Further, the technical stack information comprises at least one of compiler type, linker configuration, system SDK version, runtime library, third party dependency library, and build script parameters. Further, the similarity is calculated by using a Jaccard similarity formula: Wherein, the Technical stack information representing