Search

CN-121998079-A - Text propagation traceability analysis method and device, electronic equipment and storage medium

CN121998079ACN 121998079 ACN121998079 ACN 121998079ACN-121998079-A

Abstract

The application discloses a tracing analysis method and device for text propagation, electronic equipment and a storage medium, and relates to the fields of natural language processing and big data analysis. And generating a composite fingerprint based on the multidimensional features, and executing similarity matching and clustering based on the composite fingerprint and the dynamic propagation graph to obtain a clustering result and updating the dynamic propagation graph, wherein the dynamic propagation graph comprises user nodes and content cluster nodes. Extracting propagation characteristics of the abnormal content cluster nodes, and determining a propagation mode according to the propagation characteristics. And positioning the propagation source node through reverse traversal based on the propagation mode and the updated dynamic propagation graph so as to generate a tracing result and a response instruction. By applying the technical scheme of the application, the efficiency and the accuracy of stream text clustering can be improved, and the quick recognition of the propagation mode and the efficient positioning of the propagation source can be realized.

Inventors

  • HUANG DESHEN
  • CAO YONGCHAO
  • Tang Laixian
  • Chen Zanwang
  • LIN XIAOSHENG
  • ZENG LINGFENG
  • LI YIYONG
  • Zhan Chengzong
  • YU HAOYANG
  • YI JIALIANG
  • HAO LIBO
  • ZHANG JIACHENG
  • LIU TEWEI
  • DU QI
  • DAI JINGJING
  • He Sihang
  • LI CHENCAI
  • Hu Sidie

Assignees

  • 中移互联网有限公司
  • 中国移动通信集团有限公司

Dates

Publication Date
20260508
Application Date
20251222

Claims (17)

  1. 1. A method for traceability analysis of text propagation, comprising: acquiring streaming text data; extracting multidimensional features of the streaming text data, wherein the multidimensional features comprise statistical features, semantic features and space-time features; generating a composite fingerprint based on the multi-dimensional features; performing similarity matching and clustering based on the composite fingerprint and a dynamic propagation graph to obtain a clustering result; updating the dynamic propagation graph according to the clustering result; Extracting propagation characteristics of abnormal content cluster nodes in the content cluster nodes, and determining a propagation mode according to the propagation characteristics; And positioning a propagation source node through reverse traversal based on the propagation mode and the updated dynamic propagation graph so as to generate a tracing result and a response instruction.
  2. 2. The method of claim 1, wherein the obtaining streaming text data comprises: acquiring a text data tuple carrying propagation context information to obtain the streaming text data; wherein the propagation context information includes text content, a timestamp, a sender identification, a recipient identification, and home location information.
  3. 3. The method of claim 1, wherein prior to extracting the multi-dimensional features of the streaming text data, further comprising: Cleaning text content in the streaming text data; and identifying the functional blocks of the washed streaming text data based on the lightweight conditional random field model, wherein the functional blocks at least comprise titles, texts and emphasized texts.
  4. 4. A method according to claim 3, wherein said extracting multi-dimensional features of said streaming text data comprises: Determining a base weight based on the type of the functional block; Calculating authority according to the historical propagation data; Calculating block weights according to the authority degrees and the basic weights; performing word segmentation processing on the cleaned streaming text data, dynamically finding new words based on mutual information and left and right entropy in a sliding window, and updating a temporary dictionary to obtain candidate words; And extracting statistical features, semantic features and time-space features of the candidate words based on the block weights.
  5. 5. The method of claim 1, wherein the generating a composite fingerprint based on the multi-dimensional features comprises: determining a semantic fingerprint according to the semantic features; determining a spatiotemporal fingerprint according to the spatiotemporal features; Generating the composite fingerprint through the semantic fingerprint and the space-time fingerprint.
  6. 6. The method of claim 5, wherein said determining a semantic fingerprint from said semantic features comprises: Screening core semantic information from the semantic features; Splicing the core semantic information into character strings according to dictionary sequence ordering; performing hash operation on the character string to obtain an operation result; and determining the semantic fingerprint according to the operation result.
  7. 7. The method of claim 5, wherein said determining a spatiotemporal fingerprint from said spatiotemporal features comprises: Generating a time period code according to the timestamp of the streaming text data; generating a geographic hash character string based on attribution information of the streaming text data; Performing hash operation on the geographic hash character string to obtain region codes; and determining the space-time fingerprint according to the time period code and the region code.
  8. 8. The method of claim 5, wherein the generating the composite fingerprint from the semantic fingerprint and the spatiotemporal fingerprint comprises: performing expansion processing on the space-time fingerprints to obtain expanded space-time fingerprints; And performing bit operation on the extended space-time fingerprint and the semantic fingerprint to obtain the composite fingerprint.
  9. 9. The method of any of claims 5-8, wherein the performing similarity matching and clustering based on the composite fingerprint and dynamic propagation map comprises: Constructing a hash index based on the semantic fingerprint; Determining candidate content clusters through the hash index; calculating basic similarity between the candidate content clusters and the streaming text data, wherein the basic similarity comprises semantic similarity and space-time similarity, the semantic similarity is calculated based on the Hamming distance of the semantic fingerprints, and the space-time similarity is calculated based on the consistency of the space-time fingerprints; Calculating comprehensive similarity according to the basic similarity: Responding to the fact that the comprehensive similarity of all candidate content clusters is larger than a preset threshold, classifying the streaming text data into corresponding target candidate content clusters, and obtaining the clustering result; and creating a new content cluster to obtain the clustering result in response to the comprehensive similarity of all the candidate content clusters being smaller than or equal to the preset threshold.
  10. 10. The method of claim 9, wherein the edges of the dynamic propagation map are directed edges pointing from user nodes to content cluster nodes, and wherein updating the dynamic propagation map based on the clustering result comprises: Responding to the clustering result to create a new content cluster, creating a first content cluster node in the dynamic propagation diagram, creating an edge of the user node pointing to the first content cluster node in the dynamic propagation diagram, and initializing the heat of the first content cluster node; and responding to the clustering result as the clustering of the corresponding target candidate content cluster, creating an edge of the user node pointing to a second content cluster node in the dynamic propagation graph, and increasing the heat of the second content cluster node, wherein the second content cluster node is the content cluster node corresponding to the target candidate content cluster.
  11. 11. The method of claim 1, wherein the extracting propagation characteristics of abnormal content cluster nodes in the content cluster nodes comprises: determining an explosion coefficient, a maximum degree of emergence, a propagation depth and a propagation entropy according to the abnormal content cluster node; Determining the explosion coefficient, maximum degree of emergence, propagation depth and propagation entropy as the propagation characteristics; The burst coefficient is determined according to a time stamp of an edge associated with the abnormal content cluster node, the maximum degree of emergence is determined according to the degree of emergence of user nodes associated with the abnormal content cluster node, the degree of emergence is the total number of content cluster nodes pointed by a single user node, the propagation depth is determined according to the number of edges contained in a target propagation path, the target propagation path is the longest propagation path taking the abnormal content cluster node as a starting point, and the propagation entropy is determined according to the proportion of the number of times of propagation of the single user node in the abnormal content cluster node to the total number of times of propagation of all user nodes in the abnormal content cluster node.
  12. 12. The method of claim 11, wherein said determining a propagation mode from said propagation characteristics comprises: determining the propagation mode as viral propagation in response to the burst coefficient, the propagation depth, and the maximum degree of emergence meeting a first condition; Determining the propagation mode as a star propagation in response to the explosion coefficient, the propagation entropy, and the maximum degree of emergence meeting a second condition; In response to the burst coefficient, the maximum emittance, and the propagation depth satisfying a third condition, the propagation mode is determined to be chain propagation.
  13. 13. The method of claim 12, wherein locating the propagation source node by reverse traversal comprises: Generating a propagation geothermodynamic diagram of the abnormal content cluster node and a propagation time line, wherein the propagation geothermodynamic diagram is used for representing propagation distribution of the streaming text data in different regions, and the propagation time line is used for representing propagation time sequences of the streaming text data in different time periods; in the updated dynamic propagation graph, performing reverse breadth-first search by taking the abnormal content cluster node as a starting point to determine a propagation source node, wherein the propagation source node is a user node which is associated with the abnormal content cluster node at the earliest; generating a response instruction according to the propagation mode: And integrating the transmission source node, the transmission geographic thermodynamic diagram, the transmission time line and the response instruction to obtain the tracing result.
  14. 14. The method of claim 13, wherein generating a response instruction from the propagation mode comprises: Generating a current limit instruction and a high level alert instruction in response to the propagation mode being the viral propagation; Responding to the propagation mode for the star propagation, and generating a core node influence analysis report generation instruction; in response to the propagation mode being the chain propagation, a complete propagation path drawing instruction is generated.
  15. 15. A traceability analysis device for text propagation, comprising: the data processing module is configured to acquire streaming text data, extract multidimensional features of the streaming text data, wherein the multidimensional features comprise statistical features, semantic features and space-time features; A fingerprint generation module configured to generate a composite fingerprint based on the multi-dimensional features; The analysis module is configured to execute similarity matching and clustering based on the composite fingerprint and the dynamic propagation diagram to obtain a clustering result; the dynamic propagation graph comprises user nodes and content cluster nodes; extracting the propagation characteristics of abnormal content cluster nodes in the content cluster nodes, and determining a propagation mode according to the propagation characteristics; and the output module is configured to locate a propagation source node through reverse traversal based on the propagation mode and the updated dynamic propagation graph so as to generate a tracing result and a response instruction.
  16. 16. An electronic device, comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.
  17. 17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.

Description

Text propagation traceability analysis method and device, electronic equipment and storage medium Technical Field The present application relates to the field of natural language processing and big data analysis technologies, and in particular, to a method and apparatus for tracing text propagation, an electronic device, and a storage medium. Background The natural language processing and big data analysis technology aims at processing continuously generated text data streams, accurately identifying information propagation modes, tracking propagation paths and positioning information sources, and provides core technical support for risk management and decision making in business scenes. In the related art, for the propagation analysis of massive streaming texts, two types of core technical routes are mainly relied on. The method is based on a clustering method of content hash, only text content features are extracted through algorithms such as similar hash (SimHash) and minimum hash (MinHash) to generate hash fingerprints, text clustering is completed by means of similarity indexes such as Hamming distance, and the like. However, the above manner is difficult to meet the actual application requirements under the streaming text propagation analysis and tracing scene, and cannot realize the real-time performance of analysis, the accuracy and robustness of recognition, the flexibility of mode judgment and the interpretability of tracing results, so that when facing the dynamically changing propagation scene, the fast response is difficult, the risk is accurately recognized, and clear intervention basis is provided, thereby influencing the effects of streaming text propagation analysis and tracing. Disclosure of Invention In view of the above, the application provides a method and a device for tracing analysis of text propagation, an electronic device and a storage medium, so as to solve the problems of poor tracing effect and streaming text propagation analysis. In a first aspect, the present application provides a method for tracing analysis of text propagation, including: acquiring streaming text data; extracting multidimensional features of the streaming text data, wherein the multidimensional features comprise statistical features, semantic features and space-time features; generating a composite fingerprint based on the multi-dimensional features; performing similarity matching and clustering based on the composite fingerprint and a dynamic propagation graph to obtain a clustering result; updating the dynamic propagation graph according to the clustering result; Extracting propagation characteristics of abnormal content cluster nodes in the content cluster nodes, and determining a propagation mode according to the propagation characteristics; And positioning a propagation source node through reverse traversal based on the propagation mode and the updated dynamic propagation graph so as to generate a tracing result and a response instruction. According to the method, the streaming text data are obtained, statistical, semantic and space-time multidimensional features are extracted, the single-dimensional limitation of the traditional text content is broken through, the recognition failure problem caused by slight text modification (such as rewriting and synonym replacement) can be effectively resisted, comprehensiveness and anti-interference robustness of feature representation are improved, a composite fingerprint is generated based on the multidimensional features, similarity matching and clustering are performed by taking the composite fingerprint as a core and combining a dynamic propagation graph containing user nodes and content cluster nodes, fast clustering of hundred-million streaming data can be achieved efficiently, the dynamic propagation graph can be updated in real time through a clustering result, the synchronization of a propagation state and data change can be guaranteed, the real-time processing requirements of low delay and high throughput can be met, the propagation mode can be determined by extracting the propagation features of abnormal content cluster nodes, different propagation types such as viruses and stars can be accurately recognized without depending on the labeling data, the adaptation of the model operation and maintenance cost and the emerging propagation mode can be reduced, and finally the accuracy, the real-time interpretation and the real-time performance of a text propagation source can be improved by reversely traversing and positioning source nodes and generating response instructions based on the propagation modes and updated dynamic propagation graphs. Optionally, the method comprises the step of acquiring the text data tuple carrying the propagation context information to obtain the streaming text data, wherein the propagation context information comprises text content, a time stamp, a sender identifier, a receiver identifier and attribution information. Optionally