CN-121980403-A - Tag propagation algorithm based on triple integration similarity

CN121980403ACN 121980403 ACN121980403 ACN 121980403ACN-121980403-A

Abstract

The invention discloses a label propagation algorithm based on triple integration similarity. The method comprises the steps of 1, respectively representing an original data set as a similarity graph under three different visual angles based on three measures of distance, correlation coefficient and density, and representing all known label information in the data set as a label matrix by using one-hot coding, 2, respectively solving the corresponding three local similarity graphs and three local kernel matrices, 3, solving the final triple integrated similarity graph in an iterative mode, 4, correspondingly transforming non-diagonal elements and diagonal elements in the triple integrated similarity graph to obtain a standard propagation matrix, and 5, propagating the label information from a labeled sample to an unlabeled sample in an iterative mode, wherein the maximum value index corresponding to each sample in the label matrix is the category into which the sample is divided. The method has the advantages of better adaptability to the distribution condition of the data set, higher adaptability to the data distribution and higher robustness.

Inventors

SUN YICHEN
PING ZHENYU
Zhao tianchi
SHAO PINGPING

Assignees

江苏信息职业技术学院

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (6)

1. A tag propagation algorithm based on triple integrated similarity, comprising the steps of: Step 1, representing an original data set as a similarity graph under three different visual angles based on three measures of distance, correlation coefficient and density, and representing all known label information in the data set as a label matrix by using single thermal coding; Step 2, according to the similarity diagrams under the three different visual angles obtained in the step 1, solving corresponding three natural kernel matrixes, and according to the similarity diagrams under the three different visual angles obtained in the step 1 and a K nearest neighbor method, respectively solving corresponding three local similarity diagrams and three local kernel matrixes; Step3, setting an initialization state matrix, a maximum iteration number and an iteration termination condition according to the natural kernel matrix and the local kernel matrix obtained in the step 2, and solving a final triple integration similarity graph in an iterative mode, wherein any non-diagonal element in the triple integration similarity graph simultaneously comprises sample pair information based on distance, correlation coefficient and density; Step 4, correspondingly transforming the non-diagonal elements and diagonal elements in the triple integrated similarity graph obtained in the step 3 based on the topological property of the Hessian matrix to obtain a standard propagation matrix, so that the distance, the correlation coefficient and the density information of any sample pair are completely reserved on the premise of meeting the requirement of tag propagation; And 5, according to the tag matrix represented by the single thermal code obtained in the step 1 and the propagation matrix obtained in the step 4, propagating the tag information from the tagged sample to the untagged sample in an iterative mode based on the propagation rule assisted by the Phak law, and terminating the iteration when the iteration process reaches the maximum iteration number or meets the convergence condition, wherein the maximum value index corresponding to each sample in the tag matrix is the category into which the sample is divided.
2. The tag propagation algorithm based on triple integrated similarity according to claim 1, wherein the step 1 specifically comprises the step 1.1 of integrating all of the data sets The individual samples are divided into labeled sample sets according to the presence/absence of labels And a non-labeled sample set Wherein , , , And Are all comprised of The dimensional characteristics of the object are defined, And is also provided with Is the total number of categories; step 1.2 using an undirected weighted graph of all samples at the range view Representation of wherein While Adjacency matrix Any element in (3) The method is calculated by the following Gaussian kernel function: ; Wherein the method comprises the steps of And Are all Is a combination of any of the elements of the formula, , , Setting the bandwidth of Gaussian kernel function as the median of Euclidean distance of all samples and the obtained adjacency matrix As a first heavy initial similarity graph; step 1.3 using an undirected weighted graph of all samples at the correlation angle Representation of wherein While Adjacency matrix Any element in (3) The method is obtained by the following Pearson correlation coefficient calculation formula: ; Wherein the method comprises the steps of And Are all Is a combination of any of the elements of the formula, , , Is that Variance of (2) of the obtained adjacency matrix As a second initial similarity map; Step 1.4 using an undirected weighted graph of all samples at density view Representation of wherein , Adjacency matrix Any element in (3) The method is calculated by the following Gaussian kernel function based on the local distance: ; Wherein the method comprises the steps of And Are all Is a combination of any of the elements of the formula, , , To set the bandwidth of the Gaussian kernel function based on the local distance as the median of all samples to the local distance, the resulting adjacency matrix As a third initial similarity graph; For the sample Is given by the following formula: ; Wherein the positive integer Is a super parameter given by a user and represents the size of a local neighborhood Representing a sample And leave First, the Euclidean distance between close samples; Step 1.5 converting the tag values of all tagged samples into Wei Du thermal encoding Storing, wherein the label values of all the label-free samples are set to be Dimension zero matrix Will be And Splicing to obtain single-hot coded representation Label matrix 。
3. The tag propagation algorithm based on triple integrated similarity according to claim 2, wherein the step 2 specifically comprises the step 2.1 of obtaining three initial similarity graphs according to the step 1 Respectively calculating corresponding natural kernel matrixes Wherein any element is given by the following formula: ; So that ; Step 2.2 according to the three adjacency matrices obtained in step 1 The K nearest neighbor method calculates corresponding local similarity graphs respectively Wherein any element The following formula gives: ; Step 2.3 based on the partial similarity map obtained in step 2.2 Respectively calculating corresponding local kernel matrices Wherein the optional element The following formula gives: 。
4. The tag propagation algorithm based on triple integrated similarity according to claim 3, wherein the step 3 specifically comprises the step 3.1 of obtaining three natural kernel matrices according to the step 2.1 for any Setting an initialization state matrix Setting the maximum iteration number Setting iteration termination condition Wherein the positive number Super parameters given for users; step 3.2 for any one, the three natural kernel matrices obtained according to step 2.1 and the three local kernel matrices obtained according to step 2.3 are iterated Respectively solving any time State matrix of (a) Wherein the iterative manner is given by the following formula: ; step 3.3 according to the iterative approach in step 3.2, when Or satisfy the iteration termination condition At the time, the iteration is terminated, at this time for any The state matrix is fixed or converged to The final triple integrated similarity map is then calculated by the following formula : 。
5. The tag propagation algorithm based on triple integrated similarity according to claim 4, wherein the step 4 specifically comprises the step 4.1 of obtaining a triple integrated similarity graph according to the step 3.3 Setting super parameters Triple integration of similarity graphs Any off-diagonal element of (a) Conversion into ; Step 4.2 to meet the requirements of the topological properties of the Hessian matrix, the triple integration of the similarity graphs Any diagonal element of (3) And is super-parametric Needs to meet the requirements of ; Step 4.3, obtaining a propagation matrix meeting the topological property of the Hessian matrix according to the transformation rules in the step 4.1 and the step 4.2 The following is shown: 。
6. The tag propagation algorithm based on triple integrated similarity according to claim 5, wherein said step 5 specifically comprises the step 5.1 of representing by single thermal code obtained in step 1 Label matrix And the propagation matrix obtained in step 4.3 Based on Phak law-assisted propagation rules, setting learning rate Calculating any time by the following iterative formula Soft label vector at time : ; Step 5.2 setting a maximum iteration number Termination threshold Setting iteration termination condition When the number of iterations Or satisfy the iteration termination condition At the moment, the iteration is terminated, and the soft label vector is recorded at the moment Prediction class of arbitrary samples Can be given by the following formula: ; Wherein the method comprises the steps of Is a soft label vector Is the first of (2) And (3) row.

Description

Tag propagation algorithm based on triple integration similarity Technical Field The invention relates to the technical field of data processing, in particular to a label propagation algorithm based on triple integration similarity. Background In many machine learning tasks such as object recognition and biomedical classification, the cost of obtaining a labeled sample is extremely high and time consuming. In recent years, semi-supervised classification learning, which is specifically designed for task scenarios with marked sample scarcity, has received widespread attention. For semi-supervised classification learning tasks, the proportion of marked samples in the dataset is much lower than unmarked samples, so that a reasonable classifier cannot be constructed using traditional supervised methods. Thus, semi-supervised classification learning must be based on reasonable assumptions and assign sparse label information to a large number of unlabeled samples. In general, the mainstream semi-supervised classification learning method is based on clustering assumptions or manifold assumptions. For the clustering assumption, the correlation method assumes that samples within the same cluster have the same label, and the decision boundaries between different categories should be located in the low density region of the samples. Typical methods include converting support vector machines, support tensor machines based on low-rank approximation, and local spline regression, among others. These methods typically establish specific constraints and criteria, resulting in non-convex or NP-hard learning objectives. Manifold assumptions, on the other hand, mean that samples within the same manifold have the same label. Methods based on manifold assumptions typically use graphs to characterize samples in a dataset, where each sample is represented as a node, with similarities between nodes given by edges and their corresponding weights. These methods then smartly exploit the similarity in the graph to construct a propagation matrix with iterative properties, supported by physical or statistical laws. After properly enhancing the separability of the propagation matrix, the labels gradually and smoothly propagate from a very few marked samples to a large number of unmarked samples in the form of matrix inversions or approximate iterative updates. In general, this semi-supervised classification approach based on similarity graph construction and propagation rules is called label propagation. Existing label propagation methods include harmonic function algorithms, ficoll's law-aided propagation algorithms, k-gram algorithms, graph transition algorithms, and least-squares algorithms, among others. While existing manifold hypothesis-based label propagation methods achieve higher semi-supervised classification performance on certain test tasks, they actually have strict assumptions and limitations on the original distribution of data because the generation of the similarity graphs relies on a single perspective metric. For example, the philic law aided propagation algorithm designates euclidean distances as the only metric that generates the initial similarity map, resulting in it being only distance sensitive. The harmonic function algorithm generates a similarity graph based on a gaussian function, which, although super-parameters of the gaussian function are tunable, is still only applicable to datasets that follow a gaussian distribution. In the real world, not all available datasets follow a gaussian distribution or are related only to euclidean distance. Local density, topological properties, and many other attributes also have a significant impact on the likelihood of tag propagation between pairs of samples. Since the propagation matrix in the manifold hypothesis based label propagation method is obtained from the initial similarity map by operations such as inversion, multiplication, and division, the nature of the similarity map essentially determines the likelihood of subsequent propagation between pairs of samples. The generation of an initial similarity graph by means of only a single metric is one of the important reasons for the poor classification accuracy of existing label propagation methods on real data sets. Therefore, a good label propagation method should be concerned not only with the formulation of complex propagation rules and their corresponding physical/statistical interpretation, but also with the learning of similarity graphs. In recent years, some researchers have made preliminary efforts to fill in the gap in similarity graph learning. For example, robust adaptive embedded tag propagation and auto-weighted tag propagation based on robust triplet matrix recovery attempt to combine adaptive weight learning with tag propagation. However, the adaptive weights do not change the initial properties of a given graph. In addition, the efficient regularized label propagation learning framework incorporates reg