CN-116186622-B - Network data classification method, system, electronic equipment and storage medium

CN116186622BCN 116186622 BCN116186622 BCN 116186622BCN-116186622-B

Abstract

The invention discloses a network data classification method, which relates to the field of network data classification, is applied to a relational classifier, properly expands the range of a neighborhood when acquiring a neighbor set of a node in network data by activating a diffusion algorithm, breaks the limitation of classification information acquisition in the network data classification in a direct neighborhood acquisition mode based on first-order Markov assumption simplification processing, expands the range of the neighbor node considered when classifying the node by changing the neighborhood acquisition mode, thereby acquiring more classification information when constructing reference vectors of each category, improving the homogeneity of the node, improving the classification precision, and simultaneously iterating the category probability of the unlabeled node in combination with a collaborative reasoning mode, and further improving the accuracy of the finally obtained category probability of the unlabeled node. The invention also discloses a network data classification system, electronic equipment and a storage medium, which have the same beneficial effects as the network data classification method.

Inventors

DONG LI
WEI XIAOHUI
WU QI
YU HONGMEI
LIU JIE
XU HAIXIAO
LI XIANG
OuYang Ruochuan

Assignees

吉林大学

Dates

Publication Date: 20260508
Application Date: 20230327

Claims (7)

1. A method of classifying network data, applied to a relational classifier, the method comprising: the method comprises the steps of obtaining network data, wherein the network data is constructed based on papers connected by quotation relations and represented by graphs, the papers are represented by nodes in the network data, quotation relations among the papers are represented by edges in the network data, and the network data comprises marked papers and unmarked papers; Processing the marked paper by using an activated diffusion algorithm to obtain a first neighbor set of the marked paper, and determining a reference vector corresponding to the marked paper based on the first neighbor set, wherein the reference vector is used for recording the statistical category probability of each label corresponding to the marked paper; processing the unlabeled paper by using an activated diffusion algorithm to obtain a second neighbor set of the unlabeled paper, and determining a class vector corresponding to the unlabeled paper based on the second neighbor set, wherein the class vector is used for recording estimated class probabilities of labels corresponding to the unlabeled paper; Comparing the similarity of the reference vector and the category vector, and determining the current category probability of the unlabeled paper based on the similarity; Iterating the current class probability by adopting a collaborative reasoning method to determine the final class probability of the unlabeled paper; determining a reference vector corresponding to the marked paper based on the first neighbor set, including: determining the actual category probability of each label corresponding to the marked paper based on the first neighbor set; Determining an average value of the actual category probabilities corresponding to the same label in each marked paper as the statistical category probability corresponding to the label so as to obtain the reference vector; Iterating the current category probability by adopting a collaborative reasoning method, wherein the iterating comprises the following steps: The current class probability is adjusted according to a preset weight by using a simulated annealing algorithm; Iteratively updating the adjusted current category probability by adopting a relaxation labeling method; the first neighbor set includes neighbor nodes adjacent to the marked paper and weights corresponding to the neighbor nodes that represent a degree of closeness to the marked paper.
2. The network data classification method of claim 1, further comprising, after acquiring the network data: obtaining labels of the marked papers and calculating probability distribution of the labels; initializing the label of the unlabeled paper as the label corresponding to the maximum probability in the probability distribution.
3. The network data classification method of claim 1, wherein acquiring network data comprises: acquiring the relation between nodes in the network data; And acquiring marked papers and labels thereof in the network data, and acquiring unmarked papers in the network data.
4. A network data classification method according to any one of claims 1 to 3, further comprising: dividing the pre-collected historical marked papers into a training set and a testing set; the relationship classifier is constructed using the historical labeled papers in the training set and their labels and the historical labeled papers in the testing set.
5. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the steps of the network data classification method according to any of claims 1 to 4.
6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the network data classification method according to any of claims 1 to 4.
7. A network data classification system for use with a relational classifier, the system comprising: the system comprises an acquisition data module, a judgment module and a storage module, wherein the acquisition data module is used for acquiring network data, the network data is constructed based on papers connected by quotation relations and represented by graphs, the papers are represented by nodes in the network data, quotation relations among the papers are represented by edges in the network data, and the network data comprises marked papers and unmarked papers; The reference vector determining module is used for processing the marked paper by using an activated diffusion algorithm to obtain a first neighbor set of the marked paper, determining a reference vector corresponding to the marked paper based on the first neighbor set, and recording the statistical category probability of each label corresponding to the marked paper; The system comprises a category vector determining module, a category vector determining module and a processing module, wherein the category vector determining module is used for processing the unlabeled paper by using an activated diffusion algorithm to acquire a second neighbor set of the unlabeled paper, determining a category vector corresponding to the unlabeled paper based on the second neighbor set, and recording estimated category probabilities of labels corresponding to the unlabeled paper; the comparison module is used for comparing the similarity between the reference vector and the category vector and determining the current category probability of the unlabeled paper based on the similarity; the iteration module is used for iterating the current category probability by adopting a collaborative reasoning method so as to determine the final category probability of the unlabeled paper; the reference vector determining module comprises a first neighbor set obtaining module and a corresponding reference vector module, wherein the first neighbor set obtaining module is used for processing the marked paper by using an activated diffusion algorithm so as to obtain a first neighbor set of the marked paper, and the corresponding reference vector module is used for determining a reference vector corresponding to the marked paper based on the first neighbor set; The corresponding reference vector module includes: the actual category probability determining module is used for determining the actual category probability of each label corresponding to the marked paper based on the first neighbor set; A reference vector determining sub-module, configured to determine, as the statistical class probability corresponding to the tag, an average value of the actual class probabilities corresponding to the same tag in each tagged paper, so as to obtain the reference vector; The iteration module comprises: The weight adjustment module is used for adjusting the current class probability according to a preset weight by using a simulated annealing algorithm; The iteration sub-module is used for carrying out iteration update on the adjusted current category probability by adopting a relaxation labeling method; the first neighbor set includes neighbor nodes adjacent to the marked paper and weights corresponding to the neighbor nodes that represent a degree of closeness to the marked paper.

Description

Network data classification method, system, electronic equipment and storage medium Technical Field The present invention relates to the field of network data classification, and in particular, to a network data classification method, system, electronic device, and storage medium. Background With the development of internet information technology and various novel data acquisition technologies, traditional classifiers for data entities which are independent and uniformly distributed cannot meet the classification requirement of network data with relations, and traditional independent data examples include sales records, data in a single form of a database, and the like, and network data such as websites and hypertext. Classification is one of the main tasks in the network data mining process, and most classification methods for network data are currently based on the assumption of homogeneity (homophily assumption), i.e. interconnected entities tend to belong to the same category, which is common in the observation and theory of social networks, for example, people always gather according to their race or ethnicity. A large number of experiments show that the classification quality of the relational classifier suitable for the network data depends on the homogeneity degree of the network data to be classified, so that the relational classifier can reduce the classification error rate only by increasing the homogeneity degree of the network data. However, most of the relational methods based on the homogeneity assumption are simplified based on the first-order markov assumption, and meanwhile, because it is not realistic to consider labels of all neighbor nodes of the nodes to be classified for classification, when the relational methods based on the homogeneity assumption are adopted, only a neighborhood directly connected with a data instance in network data is usually considered, which results in limitation of information acquisition in the classification method, and unlabeled nodes can be classified only by neighbor nodes directly connected with the classification method. The first order markov assumption is understood to be that when classifying a node in the network data, the evaluation is made based only on the labels, i.e., the categories, of the neighbor nodes directly connected to the node, regardless of the indirectly connected nodes. In the prior art, the classification of unlabeled nodes is realized by combining a homogeneity relation classifier based on a first-order Markov hypothesis with a collaborative reasoning method, the probability that the unlabeled nodes belong to each classification is determined according to the labels of neighbor nodes directly connected with the unlabeled nodes, then the classification probability is iterated by the collaborative reasoning method, multiple classification results are simultaneously estimated, and finally the final classification of the unlabeled nodes is determined as the label corresponding to the maximum value in the classification probability distribution. However, the method is still realized based on the first-order markov assumption, the unmarked nodes can be classified only through the neighbor nodes directly connected with the method, the method has great limitation, the information beneficial to classification contained in the neighbor nodes indirectly connected with the method can be ignored, enough classification information can not be obtained for network data with low homogeneity, and the accuracy of the final classification result is relatively low. Disclosure of Invention The invention aims to provide a network data classification method, a system, electronic equipment and a storage medium, wherein the range of a neighborhood is properly expanded when a neighbor set of a node in network data is acquired through an activated diffusion algorithm, the limitation of classification information acquisition in the network data classification based on a direct neighborhood acquisition mode of first-order Markov assumption simplified processing is broken, the range of a neighbor node considered when the node is classified is expanded through changing the neighborhood acquisition mode, and the node which has certain correlation with the node to be classified is screened out by considering the beneficial classification information contained in an indirect neighbor node, so that more classification information is acquired when a reference vector related to each category is constructed, the homogeneity of the node is improved, the classification precision is improved, the category probability of an unlabeled node is iterated through combination with a cooperation reasoning mode, and the accuracy of the finally obtained category probability of the unlabeled node is further improved. In order to solve the technical problems, the invention provides a network data classification method, which is applied to a relation classifier, and comprises the foll