CN-121997091-A - Paper classification method based on graph matching and self-supervision graph learning

CN121997091ACN 121997091 ACN121997091 ACN 121997091ACN-121997091-A

Abstract

The invention discloses a paper classification method based on graph matching and self-supervision graph learning, and relates to the technical field of document classification based on deep learning. According to the method, the literature data is represented by adopting the literature relation graph, a graph learning model ConGM based on the literature relation graph is constructed, and citation and topic association among the literatures are mined through sub-sampling and data enhancement, linear node matching, secondary edge alignment and double-layer negative sample selection, so that the accurate classification of the paper in the field is realized.

Inventors

LV XIAOQING
LIN HONGXIANG
HU HUIYING
HE ZHICHENG

Assignees

北京大学

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (8)

1. A paper classification method based on graph matching and self-supervision graph learning is characterized in that firstly, literature data is represented through a literature relation graph, a graph learning model ConGM based on the literature relation graph is constructed, the method comprises a sub-graph sampling and data enhancement module, a linear node matching module, a secondary edge alignment module and a double-layer negative sample selection module, citation and topic association among documents are mined, and therefore accurate classification of the field of the paper is achieved, and the method comprises the following steps: 1) Constructing a literature relation graph according to the quotation relation among papers to be classified, wherein nodes of the graph represent literatures, and edges between the nodes represent quotation relation among the papers; Constructing a graph learning model ConGM based on a literature relation graph, namely a paper classification model, which comprises the following steps of 2) to 5): 2) Sub-sampling and data enhancement are carried out, namely two different nodes are extracted from a literature relation graph, an initial sub-graph is established through a random walk method, and a plurality of enhanced sub-graph views are generated through data enhancement based on the initial sub-graph; 3) Linear node matching, namely performing contrast learning on the document nodes of the document relation graph to obtain matched node pairs and enhancement subgraphs after node matching; Specifically, positive sample pairs are constructed through generating a plurality of disturbance, and structural linear loss and contrast learning loss, namely contrast learning loss is minimized, similarity of the positive sample pairs is maximized, and similarity of the negative sample pairs is minimized, so that structural and node characteristic representation of a literature relation diagram is learned; 4) According to the alignment relation of the edges, taking the node matching relation in the enhanced subgraph after node matching as the node of the edge center graph, taking the matching relation of the edges as the connecting edges, constructing the edge center graph and the contrast learning of the edges thereof; 5) Designing a double-layer negative sample selection strategy, further optimizing the characteristic representation of the document node, and collecting positive samples and difficult negative samples; The method comprises the steps of respectively forming a positive sample pair with each node in an edge center graph, taking the node and other nodes without edges as negative sample pairs in the graph of the edge center graph, calculating the most challenging negative sample based on probability distribution to obtain a cross-graph negative sample, and taking the positive sample and the negative sample as model training samples; 6) Performing model training; Designing a total loss function, wherein the total loss comprises node matching loss and edge alignment loss, training the model to obtain a trained paper classification model, and learning to obtain more accurate node characteristic representation of each paper node; 7) And carrying out inference prediction and obtaining paper classification results through a trained paper classification model.
2. The method for classifying papers based on graph matching and self-supervised graph learning as recited in claim 1, wherein in step 2), the data enhancement method includes node deletion, edge perturbation, and feature perturbation.
3. The method of classifying papers based on graph matching and self-supervised graph learning as recited in claim 1, wherein in step 3), the contrast loss function L n of node matching is expressed as formula (1): Wherein, the Representing a true match between the two enhanced views G u1 ,G u2 , The method comprises the steps of obtaining a similarity matrix from node to node between two views, wherein tau is a super parameter, controlling the degree of smoothness, maximizing the inconsistency of negative sample pairs by minimizing contrast loss, and maximizing the consistency of positive sample pairs at the same time, so that the correct matching relation between the nodes is learned.
4. The method for classifying papers based on graph matching and self-supervised graph learning as recited in claim 3, wherein in step 4), an edge center graph G u is constructed, node features are mapped to the edge center graph, and node V u ＝V u1 ×V u2 in the edge center graph represents a node in the enhanced subgraph G u1 Nodes in enhancement subgraph G u2 Correspondence between them, i.e Edge(s) Representing the correspondence between two pairs of correspondences, i.e.
5. The method for classifying papers based on graph matching and self-supervised graph learning as recited in claim 4, wherein in step 4), edge contrast loss functions Expressed as: where h represents node characteristics, function s represents cosine similarity calculation, The method is characterized in that node characteristics are obtained through learning, V is the total node number of a literature relation diagram, namely the number of papers, and V + 、V - respectively represents a positive sample pair set and a negative sample pair set.
6. The method for classifying papers based on graph matching and self-supervised graph learning as set forth in claim 5, wherein in the step 5), the dual-layer negative sample selection strategy is to identify and select the difficult negative sample pairs in the same graph according to the graph matching method, and the method specifically includes: When constructing positive sample pairs, taking misaligned node pairs as negative sample pairs; In the selection of positive and negative sample pairs in the graphs, the most challenging negative sample pair is selected between different graphs, and the specific operation is as follows: Selecting another node v as a central node in the original graph G, and generating an initial sub-graph G v ' and an enhanced sub-graph G v1 ,G v2 in the same manner as the anchor node u; Constructing a new edge center graph G v based on G v1 ,G v2 , and calculating probability distribution functions f s∣u (s) and f s∣v (s) of node characteristics in the two edge gravity center graphs, wherein s is any node; by comparing the two probability distributions, a node pair is selected as a difficult negative sample of the current node if its probability in one graph exceeds a certain threshold T and is a negative sample of the current node in the other graph.
7. The method of claim 5, wherein in step 6), node matching loss L n is used to maximize the consistency of positive sample pairs, edge alignment loss L e is used to enhance the similarity between connected nodes, and in the training process, the overall loss L is obtained by optimizing node matching loss L n and edge alignment loss L e , expressed as: L=L n +βL e , (3) where β is the loss factor that balances the two losses.
8. The method for classifying papers based on graph matching and self-supervised graph learning as recited in claim 7, wherein the loss factor value in step 6) is 0.1.

Description

Paper classification method based on graph matching and self-supervision graph learning Technical Field The invention relates to the technical field of document classification based on deep learning, in particular to a paper classification method based on self-supervision graph learning of graph matching. Background As the number of scientific literature continues to grow, efficient classification and retrieval of papers in the field is an important challenge. Conventional paper classification methods typically use text-based classification techniques, i.e., classification based on keywords and text features in the paper. But this method is not effective in processing document data having complex relationship and structure information. In particular, in academic documents, complex citation relations and topic relations exist among different papers, and the structural relation information cannot be fully utilized only by classifying text contents, so that the classification accuracy is difficult to improve. With the rise of Graph Neural Networks (GNNs), many works have used GNNs for paper classification tasks, focused on mining citation relationships between papers, building a paper relationship graph with papers as nodes, and applying graph learning algorithms. However, existing graph learning methods for paper classification (e.g., GCN, GAT) typically require labeling information of paper classes to train models, which incurs additional data labeling costs. In contrast, the graph self-supervision method in GNN can directly learn node representation from the graph structure without any label information. Graph Contrast Learning (GCL) is one such graph self-supervised learning method suitable for GNN. This approach can learn the efficient representation of a node or graph by constructing positive and negative pairs of samples and maximizing the similarity of the positive pairs while minimizing the similarity of the negative pairs. In the paper classification task, the application value of the GCL is that the GCL can capture the complex structure and characteristic information of nodes (namely the paper) in the paper relation graph, so that support is provided for the subsequent paper classification task. Despite some progress in existing research, GCL still faces several challenges in the practice of paper classification. First, conventional GCL implementations typically use contrast loss, with positive sample pairs for each anchor point formed by generating different enhancement views for the same node, while neighboring nodes are considered negative samples that need to be distinguished from the anchor point. However, in processing graph data, GNNs utilize a neighbor aggregation mechanism that assumes that connected nodes tend to have similar features or labels (i.e., homogeneity assumptions). The goal of contrast loss is to distinguish between different instances, which conflicts with the homogeneity assumption of GNNs. The second challenge is how to avoid losing critical structural information during graph data enhancement. Common data enhancement methods include node deletion, edge perturbation, etc., which can create a diversified view for a model, enabling it to learn graph features from different angles. However, the use of these enhancement strategies may lose key structural information of the graph, resulting in inconsistent structural and functional representation of the features learned by the model with the actual graph, making it poorly performing and less accurate in downstream paper classification tasks. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a paper classification method based on graph matching and self-supervision graph learning, training is carried out on a structure of a literature relation graph to obtain paper classification, firstly, literature data is represented through the literature relation graph, and a new graph learning model (ConGM for short) aiming at the paper relation graph is provided. The model utilizes the strong characterization capability of the graph neural network model to deeply mine citations and topic correlations among documents, so that more accurate classification of the paper belongs to the field is realized. The method provided by the invention has excellent performance in the scientific paper classification task, and provides a brand new effective solution for the document classification and retrieval field. For example, a paper regarding semantic perception branching in the field of computer vision can be categorized into categories such as "computer vision-classification", "computer vision-detection", "computer vision-recognition", "computer vision-segmentation", "computer vision-object tracking", etc. using the method of the present invention. The invention can be used for solving the automatic classification problem of scientific papers, patent documents, pictures and the like with association or