Search

CN-115617940-B - Document detection method and device and storage medium

CN115617940BCN 115617940 BCN115617940 BCN 115617940BCN-115617940-B

Abstract

The embodiment of the application provides a document detection method, a document detection device and a storage medium, wherein the method comprises the steps of extracting at least one text feature from a target document, and determining at least one weight value corresponding to the at least one text feature according to the occurrence times of the at least one text feature in the target document; under the condition that the target text feature is found from a preset text feature library, determining a first weight value corresponding to the target text feature from at least one weight value, acquiring a second weight value corresponding to the target text feature and a target level corresponding to the target text feature from the preset text feature library, processing the first weight value and the second weight value to obtain a first value, and under the condition that the first value is larger than the first preset value, performing response processing on the target document based on a response scheme corresponding to the target level. Through the technical scheme, the aim of improving the real-time performance of event detection is achieved.

Inventors

  • WANG XIAOJIN
  • JIN WENBIN
  • SHENG YONGFU
  • YAO YANAN

Assignees

  • 中移(杭州)信息技术有限公司
  • 中国移动通信集团有限公司

Dates

Publication Date
20260508
Application Date
20210713

Claims (9)

  1. 1. A document detection method, the method comprising: Extracting at least one text feature from a target document, and determining at least one weight value corresponding to the at least one text feature according to the number of times the at least one text feature appears in the target document; Acquiring and analyzing a second document to obtain a group of weight values corresponding to text features in the second document; determining a similarity between the target document and the second document according to the at least one weight value and the set of weight values; when the similarity is larger than a preset similarity threshold, prohibiting document detection of the second document, wherein the second document is a document with a preset judging standard value smaller than the target document; Under the condition that target text features are found out from a preset text feature library, determining a first weight value corresponding to the target text features from the at least one weight value, and acquiring a second weight value corresponding to the target text features and a target level corresponding to the target text features from the preset text feature library; Processing the first weight value and the second weight value to obtain a first numerical value; And under the condition that the first value is larger than a first preset value, responding the target document based on a response scheme corresponding to the target level.
  2. 2. The method of claim 1, wherein prior to extracting the at least one text feature from the target document, the method further comprises: Acquiring and analyzing a history document to obtain at least one initial feature in the history document, and determining at least one initial weight value corresponding to the at least one initial feature according to the occurrence times of the at least one initial feature in the history document; and generating the preset text feature library according to the at least one initial feature and the at least one initial weight value.
  3. 3. The method of claim 2, wherein the generating the library of preset text features from the at least one initial feature and the at least one initial weight value comprises: sequentially comparing the at least one initial weight value with a second preset value; Determining a first initial feature corresponding to the first weight value from the at least one initial feature under the condition that the first weight value in the at least one initial weight value is larger than the second preset value; and generating the preset text feature library according to the first initial feature and the first weight value.
  4. 4. The method of claim 1, wherein the determining the similarity between the target document and the second document based on the at least one weight value and the set of weight values comprises: Respectively inputting the at least one weight value and the group of weight values into a preset hash function to obtain a first group of hash values and a second group of hash values; determining a first set of real vectors from the first set of hash values and a second set of real vectors from the second set of hash values; determining a first semantic fingerprint corresponding to the target document according to the first set of real vectors, and determining a second semantic fingerprint corresponding to the second document according to the second set of real vectors; Determining a Hamming distance between the target document and the second document according to the first semantic fingerprint and the second semantic fingerprint; And under the condition that the Hamming distance is smaller than or equal to a third preset value, determining that the similarity is larger than the preset similarity threshold.
  5. 5. The method according to claim 1, wherein, in the case where the first value is greater than a first preset value, after performing response processing on the target document based on the response scheme corresponding to the target level, the method further includes: And replacing the second weight value in the preset text feature library by using the first weight value according to a preset replacement rule to update the preset text feature library.
  6. 6. The method according to claim 1, wherein, in the case where the first value is greater than a first preset value, after performing response processing on the target document based on the response scheme corresponding to the target level, the method further includes: Searching a first text feature which is different from the target text feature from the at least one text feature, and acquiring a third weight value corresponding to the first text feature from the at least one weight value; And correspondingly adding the first text feature and the third weight value into the preset text feature library according to the target level to update the preset text feature library.
  7. 7. A document detection apparatus, the apparatus comprising: A determining unit, configured to extract at least one text feature from a target document, and determine at least one weight value corresponding to the at least one text feature according to the number of occurrences of the at least one text feature in the target document; The determining unit is further used for obtaining and analyzing a second document to obtain a group of weight values corresponding to text features in the second document, determining similarity between the target document and the second document according to the at least one weight value and the group of weight values, prohibiting document detection on the second document when the similarity is larger than a preset similarity threshold value, wherein the second document is a document with a preset judging standard value smaller than the target document, determining a first weight value corresponding to the target text feature from the at least one weight value under the condition that the target text feature is found out from a preset text feature library, and obtaining a second weight value corresponding to the target text feature and a target level corresponding to the target text feature from the preset text feature library, wherein the preset text feature, the preset weight value and the preset level corresponding to the preset text feature are stored in the preset text feature library; The data processing unit is used for processing the first weight value and the second weight value to obtain a first numerical value; And the response unit is used for responding to the target document based on the response scheme corresponding to the target level under the condition that the first value is larger than a first preset value.
  8. 8. A document detection apparatus comprising a processor, a memory and a communication bus, wherein the processor implements the method of any of claims 1-6 when executing an operating program stored in the memory.
  9. 9. A storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-6.

Description

Document detection method and device and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a method and apparatus for detecting a document, and a storage medium. Background In the prior art, when an event occurs, a target hot event is found through a frequent item monitoring deterministic algorithm and a similarity clustering algorithm of different indexes, and corresponding measures are prepared for coping after the event is found, but by the method, whether the event is the target hot event can only be judged according to indexes such as comments, forwarding and the like after the event occurs, and good detection can not be performed immediately after the event occurs, so that the real-time performance of the event detection in the prior art is low. Disclosure of Invention The embodiment of the application provides a document detection method and device and a storage medium, which can improve the real-time performance of event detection. The technical scheme of the application is realized as follows: in a first aspect, an embodiment of the present application provides a document detection method, where the method includes: Extracting at least one text feature from a target document, and determining at least one weight value corresponding to the at least one text feature according to the number of times the at least one text feature appears in the target document; Under the condition that target text features are found out from a preset text feature library, determining a first weight value corresponding to the target text features from the at least one weight value, and acquiring a second weight value corresponding to the target text features and a target level corresponding to the target text features from the preset text feature library; Processing the first weight value and the second weight value to obtain a first numerical value; And under the condition that the first value is larger than a first preset value, responding the target document based on a response scheme corresponding to the target level. In the above document detection method, before the extracting at least one text feature from the target document, the method further includes: Acquiring and analyzing a history document to obtain at least one initial feature in the history document, and determining at least one initial weight value corresponding to the at least one initial feature according to the occurrence times of the at least one initial feature in the history document; and generating the preset text feature library according to the at least one initial feature and the at least one initial weight value. In the above document detection method, the generating the preset text feature library according to the at least one initial feature and the at least one initial weight value includes: sequentially comparing the at least one initial weight value with a second preset value; Determining a first initial feature corresponding to the first weight value from the at least one initial feature under the condition that the first weight value in the at least one initial weight value is larger than the second preset value; and generating the preset text feature library according to the first initial feature and the first weight value. In the above document detection method, after the determining at least one weight value corresponding to the at least one text feature according to the number of occurrences of the at least one text feature in the target document, before determining a first weight value corresponding to the target text feature from the at least one weight value if the target text feature is found from a preset text feature library, the method further includes: Acquiring and analyzing a second document to obtain a group of weight values corresponding to text features in the second document; determining a similarity between the target document and the second document according to the at least one weight value and the set of weight values; And when the similarity is larger than a preset similarity threshold, prohibiting document detection on the second document, wherein the second document is a document with a preset judging standard value smaller than the target document. In the above document detection method, the determining the similarity between the target document and the second document according to the at least one weight value and the set of weight values includes: Respectively inputting the at least one weight value and the group of weight values into a preset hash function to obtain a first group of hash values and a second group of hash values; determining a first set of real vectors from the first set of hash values and a second set of real vectors from the second set of hash values; determining a first semantic fingerprint corresponding to the target document according to the first set of real vectors, and determining a second semantic fingerprint