Search

CN-116955537-B - Event context link generation method integrating news occurrence time and semantic similarity

CN116955537BCN 116955537 BCN116955537 BCN 116955537BCN-116955537-B

Abstract

The invention discloses an event context link generation method integrating news occurrence time and semantic similarity, which comprises the steps of carrying out embedded representation on news text content, carrying out digital representation on the news occurrence time to complete news set pretreatment, carrying out window division on the pretreated news set according to the occurrence time to form a plurality of news subsets, combining the occurrence time and text content information of news, calculating event cluster results by using a clustering algorithm on the news subsets in each window and evaluating the event cluster results, selecting clusters belonging to the same category from the event cluster results obtained by calculation under each window to carry out fusion, selecting event cluster representative nodes in the fused cluster results to form a new round of news set, and repeating the process aiming at the newly formed news set until the final event context results are obtained. The invention improves the accuracy, the interpretability and the high efficiency of news event context result generation.

Inventors

  • ZHENG JIE
  • GU SHUANG
  • SHEN HONG
  • JIANG TAO
  • DU XING
  • CHEN XIYUAN
  • YANG SU
  • ZHOU ZHEN

Assignees

  • 苏州空天信息研究院

Dates

Publication Date
20260508
Application Date
20230804

Claims (4)

  1. 1. The event context link generation method integrating the news occurrence time and the semantic similarity is characterized by comprising the following steps of: step 1, preprocessing news data, namely performing embedded characterization on news text content, and performing digital representation on news occurrence time to complete preprocessing of news sets; step 2, window division of news sets, namely window division is carried out on the preprocessed news sets according to occurrence time to form a plurality of news subsets; Calculating news event clusters, namely calculating event cluster results by using a clustering algorithm for news subsets in each window by combining occurrence time and text content information of news, and evaluating the event cluster results; Step 4, news window event cluster fusion, namely selecting clusters belonging to the same class from event cluster results obtained through calculation under each window to fuse, and selecting event cluster representative nodes from fused cluster results to form a new news set; Step 5, iteratively updating news event context results, namely repeatedly executing the steps 2 to 4 aiming at the newly formed news set until a final event context result is obtained; Wherein: step 1, preprocessing news data, wherein the specific method comprises the following steps: Using TF-IDF to make embedded representation of news text content; analyzing news occurrence time into a time stamp according to a fixed format by using a time stamp mode; step 2, dividing news collection windows, wherein the specific method comprises the following steps: The news occurrence time stamps are sequenced from small to large to obtain the sequence of news occurrence; dividing the sequenced news sets according to windows to form a plurality of news subsets; step 3, calculating news event clusters, wherein the specific method comprises the following steps: step 3.1, calculating a semantic distance matrix between the news event sets, wherein the text content of the news event sets is expressed as a vector after the preprocessing in the step 1, and is marked as X, and the semantic distance matrix of the news event sets is calculated by using cosine distances, and the specific calculation mode is as follows: D x =1-X'X' T (2) Wherein X is a news text semantic matrix expressed by TF-IDF, one row of the matrix represents a text vector of a piece of news, X' represents a normalized X matrix, and D x represents a semantic distance matrix of a news event set; Calculating a time distance feature matrix between news events, namely taking a day as the minimum time granularity, and marking an occurrence time stamp set of the news events as { t 1 ,t 2 …t n } after the preprocessing of the step 1; firstly, constructing a time matrix of a news event set: And eliminating elements with overlong time spans in the time matrix T by using a threshold value: calculating a time distance feature matrix D t of the news event set: wherein e represents a natural constant, ij represents a matrix index, w is an adjustment coefficient, and the distribution discrete degree of the time characteristic is controlled to be 0.1 by default; And 3.3, fusing the semantic distance matrix and the time distance matrix, namely fusing the semantic distance matrix and the time distance feature matrix obtained by the calculation in the step 3.1 and the step 3.2 to obtain a fused feature matrix D mix , wherein the calculation is carried out by using a feature fusion weight, and is recorded as w t , the default value is set to be 0.5, and the calculation formula is as follows: D mix =w t D x +(1-w t )D t (6) Step 3.4, carrying out event cluster division and evaluating the division result by using a clustering algorithm, namely dividing the event clusters by using a hierarchical clustering algorithm according to the fusion feature matrix D mix obtained in the step 3.3, firstly taking multiple values as the number of clustering centers in a range of intervals for the hierarchical clustering algorithm, setting the number of default clustering centers to be between intervals [2,10 ]; And 4, fusing news window event clusters, wherein the specific method comprises the following steps of: Selecting clusters belonging to the same class for fusion according to event cluster results obtained through calculation under each window, and selecting event clusters representing nodes from the fused cluster results to replace the event clusters to form a new round of news collection, wherein a calculation method based on distance density is adopted for the selection strategy of the event clusters representing nodes, and the calculation formula of the selection strategy of the event clusters representing nodes is as follows assuming that the number of the nodes in the event clusters is N: s i ∈S,S={s 1 ,s 2 …,s N } (8) k=argmax(S) (9) Wherein i, j represents the row and column numbers of the fusion feature matrix, s i represents the score of the ith node, k represents the node index selected in the current event cluster, each event cluster has a representative node after calculation, and the representative nodes of all event clusters are combined to form a new news set.
  2. 2. The event context link generation system for fusing news occurrence time and semantic similarity is characterized by realizing event context link generation for fusing news occurrence time and semantic similarity based on the event context link generation method for fusing news occurrence time and semantic similarity of claim 1, and specifically comprising the following modules: The news data preprocessing module is used for carrying out embedded characterization on news text content and carrying out digital representation on news occurrence time; The news set window dividing module is used for dividing windows of the preprocessed news sets according to occurrence time to form a plurality of news sub-sets and finishing news set preprocessing; The news event cluster calculation module is used for calculating event cluster results by using a clustering algorithm for news subsets in each window by combining the occurrence time and text content information of news, and evaluating output results; The news window event cluster fusion module is used for selecting clusters belonging to the same class from event cluster results obtained through calculation under each window to fuse, and selecting event cluster representative nodes from fused cluster results to form a new news set; and the news event context result iterative updating module is used for repeating news set window division, news event cluster calculation and news window event cluster fusion aiming at the newly formed news set, and reserving an intermediate process link until a final event context result is obtained.
  3. 3. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing event context link generation that fuses news occurrence time and semantic similarity based on the event context link generation method of claim 1 when executing the computer program.
  4. 4. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements event context link generation that fuses news occurrence time and semantic similarity based on the event context link generation method that fuses news occurrence time and semantic similarity of claim 1.

Description

Event context link generation method integrating news occurrence time and semantic similarity Technical Field The invention belongs to a natural language processing technology, and particularly relates to an event context link generation method integrating news occurrence time and semantic similarity. Background With the rapid development of the internet, various news events are layered endlessly, and it is more difficult to mine and analyze news events from ever-increasing data. Therefore, it is important to organize event context from massive news data for analysis and research of news events. In the prior art [1,2,3,4], the event context generation method for news is essentially only analyzed from the perspective of text semantic information, whether the relation between news is calculated from words or the interactive relation between news is calculated from sentences. And calculating the similarity between news events, and then simply sequencing news with similar meanings according to the occurrence time, so as to obtain the context of the news events. Generally, the occurrence time of news events of the same theme is regular on a time axis, but the current event context generation method is only used as a sort means from the aspect of semantic information, and the influence of the occurrence time of the events on the similarity measurement of the news events and the evolution of the event development is ignored. Furthermore, existing context generation schemes lack interpretability, the model only gives the final generation results, and the reasoning links in the context generation process are lacking. Moreover, for news text sets with a large number, the existing method is easy to have the problem of low operation efficiency. [1] Zhou Xiaomin should be glowing, asparagus, nie Qinqin, dan Yi, wang Yujie, zhang Zhen, wu Fei, zhuo Caibiao, fang Sian, li Bo. An event context generating method and system incorporating deep semantic relationship classification [ P ]. Guangdong province: CN114265932A,2022-04-01. [2] Zhao Chongshuai, chuangxiong, gu Chengmin, zhou Wei, li Baoshan, chen Zhigang. Event clustering/context construction methods and related apparatus, devices and storage media [ P ]. Anhui province: CN114357159A,2022-04-15. [3] Jiao Mengshu, yao Shijie, luo Jia, lei Yuling, du Lei. Methods and apparatus for event context generation and Medium [ P ]. Hunan province: CN115878761B,2023-05-09. [4] Lin Zhengyu, shen Zhigang, tang Zhongzhu, zhou Zi, cui Junjiao A method and system for combing news event venues [ P ]. Jiangsu province, CN115964495A,2023-04-14. Disclosure of Invention The invention aims to provide an event context link generation method integrating news occurrence time and semantic similarity, and aims to improve the accuracy, the interpretability and the high efficiency of news event context result generation. The technical scheme for realizing the purpose of the invention is that the event context link generation method integrating news occurrence time and semantic similarity comprises the following steps: step 1, preprocessing news data, namely performing embedded characterization on news text content, and performing digital representation on news occurrence time to complete preprocessing of news sets; step 2, window division of news sets, namely window division is carried out on the preprocessed news sets according to occurrence time to form a plurality of news subsets; Calculating news event clusters, namely calculating event cluster results by using a clustering algorithm for news subsets in each window by combining occurrence time and text content information of news, and evaluating the event cluster results; Step 4, news window event cluster fusion, namely selecting clusters belonging to the same class from event cluster results obtained through calculation under each window to fuse, and selecting event cluster representative nodes from fused cluster results to form a new news set; And 5, iteratively updating news event context results, namely repeatedly executing the steps 2 to 4 aiming at the newly formed news set until a final event context result is obtained. Further, in step 1, the news data is preprocessed, and the specific method is as follows: Using TF-IDF to make embedded representation of news text content; The news occurrence time is parsed into a time stamp according to a fixed format using a time stamp manner. Further, in step 2, the news collection window is divided, and the specific method is as follows: The news occurrence time stamps are sequenced from small to large to obtain the sequence of news occurrence; and cutting the sequenced news sets according to windows to form a plurality of news subsets. Further, in step 3, the news event cluster is calculated, and the specific method is as follows: Step 3.1 calculating a semantic distance matrix between news event sets, the text content of the news sets being represented as vectors, here denoted as vectors, after th