Search

CN-122027189-A - Static detection method for multi-type text files and script files

CN122027189ACN 122027189 ACN122027189 ACN 122027189ACN-122027189-A

Abstract

The invention provides a static detection method, a storage medium and electronic equipment for a multi-type text file and a script file, which comprise the steps of obtaining the file to be detected, mapping each character into a predefined character class to obtain a character class sequence, counting similar characters continuously appearing in the character class sequence, classifying according to the continuous occurrence times obtained by counting to obtain a hierarchical coding sequence, combining the character class sequence with the hierarchical coding sequence to generate a structured fuzzy characteristic character string of the file, matching the structured fuzzy characteristic character string with a characteristic library in a static detection engine, and judging the safety of the file according to a matching result. The feature extraction process does not depend on specific grammar or semantics of the file, so that different types of text and script files can be processed uniformly, the feature extraction process has high universality and has certain resistance to common code deformation and simple confusion.

Inventors

  • CHEN ENJUN
  • LI SHILEI
  • TONG ZHIMING
  • ZHAO CHAO
  • XIAO XINGUANG

Assignees

  • 安天科技集团股份有限公司

Dates

Publication Date
20260512
Application Date
20251216

Claims (9)

  1. 1. A static detection method for multi-type text files and script files is characterized by comprising the following steps: Acquiring a file to be detected, wherein the file comprises a text file or a script file; Character-by-character scanning is carried out on the content of the file, and each character is mapped into a predefined character class to obtain a character class sequence, wherein the predefined character class at least comprises letters, numbers and blank characters; Counting the similar characters continuously appearing in the character class sequence, classifying by adopting a logarithmic classification coding mode according to the continuous appearing times obtained by counting, and obtaining a classification coding sequence; Combining the character class sequence with the hierarchical coding sequence to generate a structured fuzzification characteristic character string of the file; And matching the structured fuzzy characteristic character string with a characteristic library in a static detection engine, and judging the security of the file according to a matching result.
  2. 2. The method of claim 1, wherein the mapping each character into a predefined character class comprises: Mapping printable symbols not covered by the predefined character categories to other symbol categories; Unprintable control characters are ignored or filtered out.
  3. 3. The method of claim 1, wherein said classifying using a logarithmic hierarchical coding scheme comprises: Presetting a plurality of continuous time intervals, wherein each interval corresponds to one coding grade; mapping the continuous occurrence number to the coding grade corresponding to the continuous occurrence number interval.
  4. 4. A method according to claim 3, wherein the division of the consecutive intervals is determined based on a logarithmic function, such that the coding level increases logarithmically with increasing number of consecutive occurrences.
  5. 5. The method of claim 1, wherein the generating the structured fuzzification character string of the file comprises: Combining the category identifier of each character with the hierarchical code identifier of the corresponding continuous occurrence number to form a plurality of unit features; And connecting the plurality of unit features according to the character sequence to form a structured fuzzification feature character string of the file.
  6. 6. The method of claim 1, prior to said matching the structured fuzzified feature string with a feature library in a static detection engine, further comprising: and compressing or hashing the structured fuzzy characteristic character string to generate a characteristic abstract with fixed length.
  7. 7. The method of claim 1, wherein the static detection engine is a static antivirus engine; The feature library is a virus feature library in which structured fuzzification feature strings of known malicious files are stored.
  8. 8. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the method of any one of claims 1-7.
  9. 9. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 8.

Description

Static detection method for multi-type text files and script files Technical Field The present invention relates to the field of network security, and in particular, to a static detection method for multiple types of text files and script files, a storage medium, and an electronic device. Background In the field of network security, static detection technology is one of the important means of preventing malicious code threats, and the core of the static detection technology is to identify potential risks by analyzing file contents rather than actual operations. Existing static detection methods mainly extend around several technological paths. One common approach is signature detection based on feature codes. The method relies on parsing a sample of known malicious code, extracting its unique byte sequence or string as a "fingerprint" and constructing a feature library. During scanning, the threat is identified by comparing the file to be detected with records in the feature library. However, this approach is prone to failure of the feature code in the face of code confusion, encryption, or simple deformation, resulting in a missing report. In order to cope with the shortfall of feature code detection, detection techniques based on semantic and procedural analysis have emerged. Such techniques generate Abstract Syntax Trees (AST) by parsing source code, or further build Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs), to understand the structure and semantics of the code in depth, thereby identifying obfuscated or encoded malicious behavior. In addition, the method adopts normalization processing to convert the source code into a standardized intermediate representation (such as Opcode), eliminates redundant intermediate processes and code deformation, and makes malicious features appear, thereby improving detection accuracy. Although the method has higher detection precision, the method is complex in implementation, high in calculation cost and long in analysis time consumption, and the requirement of fast scanning of mass files is difficult to meet. In addition, there have been studies attempting to detect or accelerate feature matching processes using a finite state machine (e.g., ragel state machine) based on a set of hash codes. These methods can improve efficiency in certain scenarios, but the former is sensitive to subtle changes in code, while the latter still relies on a predefined feature library, which has limitations in terms of versatility and coping with unknown threats. In summary, in the existing static detection method, in the processing of multiple types of text files and script files, a trade-off needs to be made between detection efficiency, generalization capability and anti-interference capability. Simple methods are efficient but easily bypassed, and complex methods are robust but inefficient and difficult to handle different file types in a unified way. Therefore, there is an urgent need for a solution that is compatible with efficiency, versatility and robustness. Disclosure of Invention Aiming at the technical problems, the invention adopts the following technical scheme: according to one aspect of the present application, there is provided a static detection method for a multi-type text file and a script file, comprising the steps of: Acquiring a file to be detected, wherein the file comprises a text file or a script file; Character-by-character scanning is carried out on the content of the file, and each character is mapped into a predefined character class to obtain a character class sequence, wherein the predefined character class at least comprises letters, numbers and blank characters; Counting the similar characters continuously appearing in the character class sequence, classifying by adopting a logarithmic classification coding mode according to the continuous appearing times obtained by counting, and obtaining a classification coding sequence; Combining the character class sequence with the hierarchical coding sequence to generate a structured fuzzification characteristic character string of the file; And matching the structured fuzzy characteristic character string with a characteristic library in a static detection engine, and judging the security of the file according to a matching result. Wherein said mapping each character into a predefined character class comprises: Mapping printable symbols not covered by the predefined character categories to other symbol categories; Unprintable control characters are ignored or filtered out. Wherein, the adoption log hierarchical coding mode classifies, includes: Presetting a plurality of continuous time intervals, wherein each interval corresponds to one coding grade; mapping the continuous occurrence number to the coding grade corresponding to the continuous occurrence number interval. The division of the continuous frequency interval is determined based on a logarithmic function, so that the coding level increases logar