CN-121997917-A - File similarity detection method based on Simhash fusion keyword and keyword sentence extraction

CN121997917ACN 121997917 ACN121997917 ACN 121997917ACN-121997917-A

Abstract

The invention belongs to the technical field of computer information processing, and discloses a file similarity detection method based on Simhash fusion keyword and keyword sentence extraction, which comprises the steps of firstly carrying out text pretreatment and double granularity feature extraction to respectively obtain a keyword and a keyword sentence with weight; secondly, providing a weight fusion strategy, contributing keyword weights to the located keywords according to occurrence frequencies to form fusion weights, then carrying out optimized SimHash feature coding, independently generating 64-bit fingerprints on the keywords and the keywords by adopting grouping hash functions based on different seeds, splicing the two fingerprints into 128-bit final feature fingerprints, and finally judging similarity by calculating Hamming distances among fingerprints of different files. The invention obviously improves the detection precision and the robustness of the algorithm on synonymous substitution, word order adjustment and paragraph recombination through word and sentence double granularity feature fusion and packet hash coding.

Inventors

LIU ZHI
TANG NING

Assignees

深圳市石犀科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (9)

1. The file similarity detection method based on Simhash fusion keyword and keyword sentence extraction is characterized by comprising the following steps: Step S1, performing text preprocessing and double granularity feature extraction, namely performing unified coding and noise filtering on a file to be detected to obtain a purified text; step S2, carrying out weight fusion of the keywords and the keywords, namely fusing the weights of the keywords into the weights of the corresponding keywords according to the occurrence frequency of the keywords in the keywords, and calculating to obtain the fused weights of the keywords; step S3, carrying out SimHash feature coding optimization: Step S3.1, performing keyword feature coding, namely generating keyword feature fingerprints by adopting a grouping hash strategy based on a plurality of different hash seeds based on each keyword and the normalization weight thereof; Step S3.2, performing key sentence feature coding, namely generating key sentence feature fingerprints by adopting a grouping hash strategy based on a plurality of different hash seeds based on each key sentence and fusion weights thereof; Step S3.3, performing double granularity feature fusion, namely splicing the feature fingerprints of the keywords and the feature fingerprints of the keywords to generate final feature fingerprints of the file to be detected; And S4, carrying out similarity calculation, namely judging the similarity of the files by calculating the Hamming distance between final characteristic fingerprints of different files.
2. The method for detecting file similarity based on Simhash fusion keywords and keyword extraction according to claim 1 is characterized in that in the step S1, the specific operation of extracting keywords and normalized weights thereof is that the weights of the keywords are calculated by adopting a TF-IDF algorithm, and Min-Max normalization processing is carried out on the weights to obtain normalized weights in a [0,1] interval, and the specific operation of extracting the keywords and basic weights thereof is that the basic weights of the keywords are calculated by adopting a textRank algorithm, wherein the basic weights are in the [0,1] interval.
3. The method for detecting file similarity based on Simhash fusion keyword and keyword sentence extraction as claimed in claim 1, wherein the keyword sentence fusion weight in the step S2 is The calculation formula of (2) is as follows: ; In the above Representing keywords In key sentence Is used to determine the frequency of occurrence of the signal, Representing normalized keywords Is used for the weight of the (c), Representing keywords The contribution adjustment coefficient is a configurable super parameter used for controlling the enhancement amplitude of the keyword to the weight of the keyword sentence, the value range is 0.1 to 0.3, Representing the basis weight of the key sentence.
4. The method for detecting file similarity based on Simhash fusion keyword and keyword sentence extraction as claimed in claim 1, wherein the specific step of generating the keyword feature fingerprint in the step S3 includes: Step a, for each keyword, using MurmurHash functions and four different first seeds to respectively generate four groups of 64-bit first intermediate hash values; step b, extracting high 16 bits from each group of first intermediate hash values, and splicing all extracted fragments in sequence to form a temporary hash representation of the keyword; step c, carrying out weighted accumulation on vectors formed by the temporary hash representations of all the keywords according to the normalized weights of the keywords to obtain a first accumulated vector; step d, binarizing the first accumulated vector to finally obtain 64-bit keyword feature fingerprints 。
5. The method for detecting file similarity based on Simhash fusion keyword and keyword-sentence extraction as recited in claim 4, wherein the specific step of generating the keyword-sentence feature fingerprint in the step S3 includes: Step a, for each key sentence, using CityHash functions and four different second seeds to respectively generate four groups of 64-bit second intermediate hash values; step b, extracting high 16 bits from each group of second intermediate hash values, and splicing all extracted fragments in sequence to form a temporary hash representation of the key sentence; Step c, according to the fusion weight of each key sentence, carrying out weighted accumulation on vectors formed by the temporary hash representations of all the key sentences to obtain a second accumulated vector; Step d, binarizing the second accumulated vector to finally obtain 64-bit key sentence characteristic fingerprints 。
6. The method for detecting file similarity based on Simhash fusion keyword and keyword sentence extraction as recited in claim 5, wherein the final feature fingerprint is obtained by extracting the feature fingerprint of the keyword And key sentence feature fingerprint Directly and sequentially splicing to form the 128-bit double-granularity fused final characteristic fingerprint.
7. The method for detecting file similarity based on Simhash fusion keyword and keyword extraction according to claim 1 is characterized by further comprising a weight proportion configuration step of distributing different fusion proportion coefficients for the keyword features and the keyword features before feature encoding is carried out in the step S3, wherein in the step S3.1 and the step S3.2, the normalization weight of the keyword and the fusion weight of the keyword are respectively adjusted by using the corresponding fusion proportion coefficients during weighted accumulation.
8. The method for detecting file similarity based on Simhash fusion keyword and keyword extraction according to claim 7 is characterized in that the fusion scaling factor is configured according to text types, wherein the fusion scaling factor of the keyword features is improved for technical documents, and the fusion scaling factor of the keyword features is improved for short text summaries.
9. The method for detecting file similarity based on Simhash fusion keywords and keyword extraction of claim 1, wherein the unified encoding of the files to be detected is to convert the files to be detected into UTF-8 codes and delete BOM marks.

Description

File similarity detection method based on Simhash fusion keyword and keyword sentence extraction Technical Field The invention relates to the technical field of computer information processing, in particular to a file similarity detection method based on Simhash fusion keywords and keyword sentence extraction. Background The file similarity detection is widely applied to the fields of plagiarism detection, news aggregation, data deduplication and the like. The traditional method is mainly divided into three types, namely a method based on accurate hash (such as MD 5), which is sensitive to slight changes of texts and cannot be subjected to fuzzy matching, a method based on local sensitive hash (such as SimHash), which can be subjected to fuzzy matching, but is usually dependent on single granularity (such as word frequency) and is easy to be interfered by noise, and has insufficient robustness in rewriting semantic retention such as paragraph recombination and word order adjustment, and a method based on a deep semantic model (such as BERT), which is high in calculation cost and difficult to be applied to real-time comparison of mass files. The existing SimHash technology generally directly carries out hash mapping on word frequency vectors after word segmentation, ignores the syntax structure and important semantic units (key sentences) of a document, and leads to the significant reduction of detection accuracy when core semantics are reserved but sentence-based transformation, paragraph merging or splitting are carried out. In addition, the conventional single hash function is easy to generate characteristic distribution deviation, and affects the stability of similarity judgment. Disclosure of Invention In order to overcome the technical defects in the prior art, the invention provides a file similarity detection method based on Simhash fusion keywords and keyword sentence extraction, which comprises the following steps: Step S1, performing text preprocessing and double granularity feature extraction, namely performing unified coding and noise filtering on a file to be detected to obtain a purified text; step S2, carrying out weight fusion of the keywords and the keywords, namely fusing the weights of the keywords into the weights of the corresponding keywords according to the occurrence frequency of the keywords in the keywords, and calculating to obtain the fused weights of the keywords; step S3, carrying out SimHash feature coding optimization: Step S3.1, performing keyword feature coding, namely generating keyword feature fingerprints by adopting a grouping hash strategy based on a plurality of different hash seeds based on each keyword and the normalization weight thereof; Step S3.2, performing key sentence feature coding, namely generating key sentence feature fingerprints by adopting a grouping hash strategy based on a plurality of different hash seeds based on each key sentence and fusion weights thereof; Step S3.3, performing double granularity feature fusion, namely splicing the feature fingerprints of the keywords and the feature fingerprints of the keywords to generate final feature fingerprints of the file to be detected; And S4, carrying out similarity calculation, namely judging the similarity of the files by calculating the Hamming distance between final characteristic fingerprints of different files. Preferably, in the step S1, the specific operation of extracting the key words and the normalized weights thereof is that the weight of each key word is calculated by adopting a TF-IDF algorithm, and the weights are subjected to Min-Max normalization processing to obtain the normalized weights in the [0,1] interval, and the specific operation of extracting the key sentences and the basic weights thereof is that the basic weights of each key sentence are calculated by adopting a textRank algorithm, wherein the basic weights are in the [0,1] interval. Preferably, the key sentence fusion weights in step S2The calculation formula of (2) is as follows: ; In the above Representing keywordsIn key sentenceIs used to determine the frequency of occurrence of the signal,Representing normalized keywordsIs used for the weight of the (c),Representing keywordsThe contribution adjustment coefficient is a configurable super parameter used for controlling the enhancement amplitude of the keyword to the weight of the keyword sentence, the value range is 0.1 to 0.3,Representing the basis weight of the key sentence. Preferably, the specific step of generating the feature fingerprint of the keyword in the step S3 includes: Step a, for each keyword, using MurmurHash functions and four different first seeds to respectively generate four groups of 64-bit first intermediate hash values; step b, extracting high 16 bits from each group of first intermediate hash values, and splicing all extracted fragments in sequence to form a temporary hash representation of the keyword; step c, carrying out weighted accumulation on