Search

CN-122019772-A - Author name disambiguation method based on meta-path random walk network embedding and semantic characterization

CN122019772ACN 122019772 ACN122019772 ACN 122019772ACN-122019772-A

Abstract

The invention discloses an author name disambiguation method based on meta-path random walk network embedding and semantic characterization, which is characterized in that characteristics of papers are analyzed and divided into semantic characteristics and discrete characteristics, relationships among the papers are constructed by utilizing the discrete characteristics, node characterization vectors corresponding to IDs of each paper are obtained, then a paper relationship similarity matrix is obtained, semantic characterization vectors of the papers are obtained by utilizing the semantic characteristics, then the paper semantic similarity matrix is obtained, the two matrices are added and averaged to obtain a final paper similarity matrix, the matrices are input into DBSCAN according to the paper similarity matrix to obtain a pre-clustered discussion set, the papers in the pre-clustered paper set are classified and trained according to the existing author names by using xgboost algorithm, the discrete paper set is reassigned to clustered authors or new authors by using xgboost classification method, and the discrete paper set and the pre-clustered discussion set are integrated to obtain the final results of all the papers.

Inventors

  • DU LIN
  • XU WEIJIE

Assignees

  • 河南省人才数字科技有限公司

Dates

Publication Date
20260512
Application Date
20260121

Claims (10)

  1. 1. The author name disambiguation method based on the embedding of the meta-path random walk network and the semantic characterization is characterized by comprising the following steps: s1, analyzing characteristics of papers and dividing the characteristics into semantic characteristics and discrete characteristics; S2, constructing a relation among papers by utilizing discrete features, and constructing a heterogeneous network of the papers; s3, generating a path set consisting of paper IDs by using random walks based on meta paths; s4, training a path set by using a skip-gram model, regarding the path set as a corpus, regarding paper IDs as vocabularies, and finally obtaining node characterization vectors corresponding to each paper ID; S5, extracting K path sets, training the K path sets to obtain K word2vec models, and generating K groups of paper vectors; s6, solving a cosine similarity matrix for each group of paper vectors, and solving an average value of the k similarity matrices to obtain a final paper relation similarity matrix; S7, synthesizing the same text by using semantic features, and preprocessing the same text to obtain processed words; s8, generating word vectors and solving the average value of the processed words through a word2vec model to obtain semantic characterization vectors of papers; S9, for each name to be disambiguated, obtaining semantic representation vectors of all papers, when all words of a certain paper do not exist in a word2vec model, setting the semantic representation vectors to be 0, storing the semantic representation vectors into an outlier paper set, and subsequently reprocessing the semantic representation vectors; S10, after the paper relation similarity matrix and the paper semantic similarity matrix are obtained, adding the paper relation similarity matrix and the paper semantic similarity matrix to obtain an average value, and obtaining a final paper similarity matrix; s11, inputting the matrix into a DBSCAN according to the paper similarity matrix to obtain a pre-clustering discussion set; S12, aiming at the papers in the pre-clustered paper set, carrying out classification training according to the existing author names by using xgboost algorithm, and then, reassigning the papers in the outlier paper set to the clustered authors or new authors by using xgboost classification method to obtain a final disambiguation result.
  2. 2. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 1, wherein in step S1, semantic features refer to text features with semantic information, text-to-body features are converted into text semantic vectors through a semantic representation learning model, discrete features refer to text information values of the text-to-body features are not great, only when one author appears in two articles at the same time, the fact that two articles have one author in common is indicated, and therefore similarity between the two articles is great, and class features with great similarity are called discrete features and are only used for conversion into relations between the articles.
  3. 3. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization of claim 2, wherein the features of the paper include title, abstract, author, field, organization and keywords, title, place, year and keywords are defined as semantic features, and author and publishing mechanism are defined as discrete features.
  4. 4. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 3, wherein in step S3, the random walk of the meta-path is in a paper heterogeneous network, each node is selected in turn, the node is taken as an initial node, the random walk is carried out on edges between the nodes, the walked path is saved as a training corpus of word2vec, and the random walk based on the meta-path means that the walk is not completely random but guided by the meta-path when the random walk is carried out on the edges.
  5. 5. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 4, wherein the meta-paths walk according to the meta-path sequence of p 1- & gt CoAuthor- & gt p 2- & gt CoOrg- & gt p3 in the random walk process, in each walk process, selecting a next node according to a certain type of edge specified by the current meta-path, randomly selecting a node connected with the current node through the certain type of edge as the next node, in each long path, repeatedly sampling a plurality of times of meta-paths, namely, the last node of the previous meta-path is used as the first node of the next meta-path, finally iterating until reaching a certain number, then selecting another node as the initial node for walk, finally generating a plurality of long paths, wherein each path point is the ID of paper, storing each long path according to rows, and generating a training corpus.
  6. 6. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 5, wherein the weight of an edge is considered in the process that a node randomly selects the next node to walk towards the edge of a certain type under meta-path guidance, the greater the weight is, the closer the relationship between two nodes is explained, the greater the probability that the node jumps along the relationship is, each node in each round of selection graph starts random walk as a starting point, the number of rounds is defined as numwalks, and a path set consisting of numwalks x N paths is generated, wherein N is the number of nodes.
  7. 7. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization of claim 6, wherein in step S7, the same text is pre-processed by first reducing letters to remove symbols other than letters, then removing redundant blank spaces, separating words by blank spaces, and removing stop words and words with a length less than three.
  8. 8. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 7, wherein the semantic features synthesize the same piece of text by spacing the titles, venue, organization, year, keywords of the paper.
  9. 9. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 8, wherein in step S2, when the paper heterogeneous network is constructed, by using author and organization, for each name needing disambiguation, extracting the relation between all papers corresponding to the name needing disambiguation, and constructing a paper heterogeneous network, wherein the heterogeneous network comprises one type of node and two types of edges, one node represents one paper, and the two types of edges are CoAuthor and CoOrg respectively.
  10. 10. The method for author name disambiguation based on meta-path random walk network embedding and semantic characterization according to claim 9, wherein in step 12, the specific operation of reassigning papers in the outlier paper set to already clustered authors or new authors is to perform classification training on papers in the pre-clustered paper set according to the existing author names by using xgboost algorithm, then classifying the outlier discourse by using the trained model, matching the outlier discourse to the existing authors or the new authors, and considering the outlier discourse to belong to the new authors if the confidence of classification is lower than a set threshold.

Description

Author name disambiguation method based on meta-path random walk network embedding and semantic characterization Technical Field The invention belongs to the technical field of information, and particularly relates to an author name disambiguation method based on meta-path random walk network embedding and semantic characterization. Background Disambiguation of homonymies has been considered a valuable and challenging problem in many fields, such as document management and social network analysis. For disambiguation of co-name authors in papers, the aim is to correctly attribute the papers to corresponding author files by a certain method by using various information of the papers, such as titles, authors, institutions where the authors are located, abstracts, keywords and the like. Currently, many researchers have proposed corresponding solutions to this co-name author disambiguation problem, which mainly include rule-based matching methods, feature learning of paper information with paper information or using representation learning methods, and then classifying these feature vectors using clustering methods (e.g., hierarchical clustering, DBSCAN, etc.), so that similar papers are clustered together and dissimilar papers are classified into different categories. However, the solution usually fuses the pre-clustered discourse, so that it is inconvenient to distribute the papers in the discrete paper set to respective authors, so that it is inconvenient to obtain all the papers of the same name author, and therefore we need to propose an author name disambiguation method based on the embedding of the meta-path random walk network and semantic characterization to solve the above-mentioned problems. Disclosure of Invention The invention aims to provide an author name disambiguation method based on meta-path random walk network embedding and semantic characterization, which adopts a heterogeneous network embedding technology based on meta-path random walk and a semantic feature learning method based on word2vec to learn feature vectors of papers, and distributes the papers to different authors by combining DBSCAN clustering and xgboost classification methods so as to solve the problems in the background technology. In order to achieve the above purpose, the invention adopts the following technical scheme: An author name disambiguation method based on meta-path random walk network embedding and semantic characterization comprises the following steps: s1, analyzing characteristics of papers and dividing the characteristics into semantic characteristics and discrete characteristics; S2, constructing a relation among papers by utilizing discrete features, and constructing a heterogeneous network of the papers; s3, generating a path set consisting of paper IDs by using random walks based on meta paths; s4, training a path set by using a skip-gram model, regarding the path set as a corpus, regarding paper IDs as vocabularies, and finally obtaining node characterization vectors corresponding to each paper ID; S5, extracting K path sets, training the K path sets to obtain K word2vec models, and generating K groups of paper vectors; s6, solving a cosine similarity matrix for each group of paper vectors, and solving an average value of the k similarity matrices to obtain a final paper relation similarity matrix; S7, synthesizing the same text by using semantic features, and preprocessing the same text to obtain processed words; s8, generating word vectors and solving the average value of the processed words through a word2vec model to obtain semantic characterization vectors of papers; S9, for each name to be disambiguated, obtaining semantic representation vectors of all papers, when all words of a certain paper do not exist in a word2vec model, setting the semantic representation vectors to be 0, storing the semantic representation vectors into an outlier paper set, and subsequently reprocessing the semantic representation vectors; S10, after the paper relation similarity matrix and the paper semantic similarity matrix are obtained, adding the paper relation similarity matrix and the paper semantic similarity matrix to obtain an average value, and obtaining a final paper similarity matrix; s11, inputting the matrix into a DBSCAN according to the paper similarity matrix to obtain a pre-clustering discussion set; S12, aiming at the papers in the pre-clustered paper set, carrying out classification training according to the existing author names by using xgboost algorithm, and then, reassigning the papers in the outlier paper set to the clustered authors or new authors by using xgboost classification method to obtain a final disambiguation result. Preferably, in step S1, the semantic features refer to text features with semantic information, the text features are converted into text semantic vectors through a semantic representation learning model, the discrete features refer to text information values of the text featur