Search

CN-121706092-B - DirHash algorithm-based APK fingerprint association method and system

CN121706092BCN 121706092 BCN121706092 BCN 121706092BCN-121706092-B

Abstract

The invention discloses an APK fingerprint association method and system based on DirHash algorithm, which belong to the technical field of network security, wherein the method comprises the following steps of firstly receiving and decompressing APK files to obtain directory structures of the APK files, secondly normalizing the directory structures obtained in the first step, removing all file nodes, assets directories, subtrees and metadata information thereof, thirdly performing topological serialization coding on the normalized directory structures to finish JSON serialization output to obtain JSON sequences, fourthly performing hash value calculation on the JSON sequences obtained in the third step by utilizing a hash algorithm to obtain DirHash fingerprint values, and fifthly, completing fingerprint association and cluster analysis by utilizing DirHash fingerprint values obtained in the fourth step, thereby solving the technical problem that the traditional fingerprint method fails in resisting packet name randomization, certificate rotation, resource fine adjustment and other attack means.

Inventors

  • QI WEI
  • MA YONGXIAO
  • ZHANG RUIDONG
  • Tong Yongao

Assignees

  • 成都无糖信息技术有限公司

Dates

Publication Date
20260505
Application Date
20260214

Claims (5)

  1. 1. An APK fingerprint association method based on DirHash algorithm is characterized by comprising the following steps: step one, receiving and decompressing an APK file to obtain a directory structure of the APK file; Step two, the directory structure obtained in the step one is normalized, namely all file nodes, assets directories, subtrees and metadata information are removed; Performing topology serialization coding on the normalized directory structure to finish JSON serialization output to obtain a JSON sequence; fourthly, performing hash value calculation on the JSON sequence obtained in the third step by utilizing a hash algorithm to obtain DirHash fingerprint values; fifthly, utilizing DirHash fingerprint values obtained in the fourth step to complete fingerprint association and cluster analysis; In the third step, first, depth-first traversal is performed: A, starting from a root node; B, accessing a first child node; c, accessing a second child node; d, accessing a third child node; E, finishing the traversal; then ordering the direct child nodes of each directory node according to the dictionary sequence, wherein the direct child nodes comprise the child nodes of the res directory and the child nodes of the lib directory; Finally, performing JSON serialization output, namely arranging the JSON key names according to alphabetical order, removing blank spaces and line feed in the JSON sequences, and converting the JSON key names into compact character strings; in the fourth step, firstly, the input character string is converted into a UTF-8 byte sequence during calculation; then processing the byte sequence by using SHA-1 algorithm, namely adding padding to enable the length of the byte sequence to meet the multiple of 512 bits, dividing the message into 512-bit blocks, performing 80-round iterative operation on each block to generate 160-bit 20-byte hash values, and finally converting a 20-byte binary character string into a 40-bit hexadecimal character string, wherein the message is a JSON character string after serialization of a directory structure; In the fifth step, query logic is executed first, all APK nodes connected to the same DirHash node are searched through DirHash nodes as intermediate bridges, paths of dirhash _X and apk_hash_ A, APK _hash_B which are related to each other are established, a homologous group is formed, statistics and identification are carried out on all the homologous groups, the number of APK nodes connected with each DirHash node is counted, then descending order of the number is adopted, large-scale group partner is identified, and cluster analysis is completed.
  2. 2. The method for associating APK fingerprints based on DirHash algorithm is characterized in that in the first step, an APK file is checked for integrity, whether the file extension is APK or not is checked, whether the file size meets the system requirement or not is checked, whether the file is in a zip format or not is confirmed by reading the magic number of the file header, then a zip file meeting the requirement is extracted in a structure mode, a zip decompression library is called to create a temporary working directory, the APK file is decompressed as a zip compression package, a decompressed complete file and a directory list are obtained, and finally a tree data structure is constructed for the decompressed directory structure by means of a recursive traversal algorithm.
  3. 3. The APK fingerprint association method based on DirHash algorithm as claimed in claim 1, wherein in the second step, node type identification is first performed before removing information, a directory tree structure is read, then selective filtering rules are executed, all file nodes, assets directories and sub-tree and metadata information thereof in the directory tree structure are removed, and finally a normalized directory tree structure is generated.
  4. 4. The APK fingerprint association method based on DirHash algorithm according to claim 3, wherein the file node includes AndroidManifest. Xml, classification. Dex, resources. Arsc, icon. Png, library desk. So, the assets directory format includes assets/and config. Json and fonts contained below, and the metadata information includes a timestamp, a rights attribute, i.e. rwx flag, and owner information of each directory.
  5. 5. An APK fingerprint association system based on DirHash algorithm, which is characterized by comprising the following modules: The APK analysis and preprocessing module is used for receiving APK files to be analyzed, extracting an APK internal complete directory tree structure by using a ZIP decompression technology, and then constructing a tree data structure of directory nodes; DirHash fingerprint generation module, which is to perform structure normalization processing on the directory tree, then to perform topology serialization coding and generate anti-interference hash fingerprint; storing and managing DirHash a fingerprint library, then carrying out homologous APK matching and association, and finally outputting a cluster clustering analysis result; in the DirHash fingerprint generation module, the structure normalization processing process for the directory tree is as follows: Firstly recursively traversing the directory tree decompressed by the APK to obtain a complete directory structure, then filtering all file nodes, only retaining the directory node information, deleting assets the directory and its child nodes, removing the metadata information of the directory nodes, and finally constructing a pure directory topology tree structure; Traversing the normalized directory topology tree by adopting a depth-first traversal strategy, sequencing the directory tree into a JSON format character string according to dictionary sequence ordering of child node names in each directory node, and finally sequencing the JSON key names according to letter sequence to eliminate randomness caused by dictionary unordered; The process of generating the anti-interference hash fingerprint comprises the steps of carrying out UTF-8 coding on the JSON character string generated by executing the topological serialization coding process, then calculating a message digest by using an SHA-1 hash algorithm, and finally converting a 160-bit binary hash value into a 40-bit hexadecimal character string, wherein the 40-bit hexadecimal character string is DirHash fingerprint, and the message digest is a 160-bit binary hash value; the method comprises the steps of after an APK to be detected generates DirHash fingerprint through a DirHash fingerprint generation module, inquiring whether the same DirHash value exists in a fingerprint library, judging that the APK belongs to the same group if the APK exists, creating a new group if the APK does not exist, warehousing the DirHash value, firstly classifying all APKs with the same DirHash value into one group when a group clustering analysis result is output, counting the number of samples, the first occurrence time and the active period of each group, and finally outputting a group association analysis report.

Description

DirHash algorithm-based APK fingerprint association method and system Technical Field The invention belongs to the technical field of networks, and particularly relates to an APK fingerprint association method and system based on DirHash algorithm. Background In mobile application security analysis and network anomaly detection practices, homologous APK (i.e., android Package) identification is a key technical requirement. At present, fingerprint methods such as packet name identification (extracted from android management. Xml, but easily randomized), certificate fingerprint, whole packet hash (calculating APK file SHA-256, but highly sensitive to content), application name and icon identification (easily disguised) and the like are mainly adopted in the industry. However, in actual processing, the abnormal APK presents the characteristics of multi-channel distribution, frequent variant iteration, deep confusion reinforcement and the like, and an attacker implements countermeasure means such as packet name randomization, certificate rotation, resource fine adjustment and the like through automation, so that a single abnormal application family is dispersed into tens of thousands of independent fingerprints in a sample library, and the traditional method is clustered and disabled. Disclosure of Invention Aiming at the problems that attack means such as package name randomization, certificate rotation, resource fine adjustment and the like are difficult to resist, and calculation cost is large when code decompilation and dynamic execution are carried out in the prior art, the invention provides an APK fingerprint association method based on DirHash algorithm, which comprises the following steps: step one, receiving and decompressing an APK file to obtain a directory structure of the APK file; Step two, the directory structure obtained in the step one is normalized, namely all file nodes, assets directories and subtrees thereof and metadata information are removed; Performing topology serialization coding on the normalized directory structure to finish JSON serialization output to obtain a JSON sequence; fourthly, performing hash value calculation on the JSON sequence obtained in the third step by utilizing a hash algorithm to obtain DirHash fingerprint values; and fifthly, utilizing DirHash fingerprint values obtained in the fourth step to complete fingerprint association and cluster analysis. In the preferred step one, firstly, the integrity check is carried out on an APK file to verify whether the file extension name is APK, and whether the file size meets the system requirement is detected, and whether the file is in a zip format is confirmed by reading the magic number of the file header, then the structure of the zip file is extracted, a zip decompression library is called to create a temporary working catalog, the APK file is decompressed as a zip compression package, a decompressed complete file and a catalog list are obtained, and finally the construction of a tree data structure is carried out on the catalog structure obtained after decompression by using a recursion traversal algorithm. Preferably, in the second step, the node type identification is firstly performed before the information is removed, the directory tree structure is read, then the selective filtering rule is executed, all file nodes, assets directories and subtrees and metadata information thereof in the directory tree are removed, and finally the normalized directory tree structure is generated. Preferably, the file node comprises AndroidManifest.xml, class.dex, resources.arsc, icon.png and librousdesk.so, wherein the assets directory format comprises assets/config.json and fonts contained below, and the metadata information comprises a time stamp of each directory, a permission attribute, i.e. rwx mark and owner information. Preferably, in step three, a depth-first traversal is first performed: A, starting from a root node; B, accessing a first child node; c, accessing a second child node; d, accessing a third child node; E, finishing the traversal; And finally, performing JSON serialization output, arranging the JSON key names according to the alphabetical order, removing blank spaces and line feed in the JSON sequence, and converting into compact character strings. Preferably, in the fourth step, firstly, the input character string is converted into a UTF-8 byte sequence during calculation; then using SHA-1 algorithm to process byte sequence, adding padding to make byte sequence length meet multiple of 512 bits, then dividing message into 512 bit blocks, executing 80-round iterative operation on each block to generate 160-bit 20-byte hash value, finally converting 20-byte binary character string into 40-bit hexadecimal character string, and making message into JSON character string with directory structure. In the fifth step, query logic is executed first, all APK nodes connected to the same DirHash node are searched through Di