CN-121542881-B - Data classification method, device and equipment based on anchor point guide clustering

CN121542881BCN 121542881 BCN121542881 BCN 121542881BCN-121542881-B

Abstract

The application discloses a data classification method, a device and equipment based on anchor point guided clustering, and relates to the technical field of electric digital data processing, wherein the method comprises the steps of obtaining an original data set to be classified, converting each sample data into a numerical vector, and constructing a data matrix; initializing a clustering center matrix and an anchor point matrix, setting fuzzy coefficients, obtaining the anchor point matrix and the clustering center matrix after joint iteration optimization, calculating the fuzzy membership degree of each sample vector, taking the category with the maximum value of the fuzzy membership degree as the final category label of the sample vector, and outputting the classification result of all the sample vectors. The application solves the problems that the existing fuzzy clustering method is unstable and inaccurate in classification result due to sensitivity to initial conditions and easy sinking into suboptimal solution, and can not be directly solved by using gradient descent algorithm, so that the method is difficult to be applied to large-scale data sets, and realizes the enhancement of the classification precision of complex data and the applicability under different data scale scenes.

Inventors

CAI YINGJIE
YANG HUI
ZHU JIANYONG
Nie Feiping

Assignees

华东交通大学

Dates

Publication Date: 20260508
Application Date: 20260119

Claims (6)

1. The data classification method based on the anchor point guided clustering is characterized by comprising the following steps: Acquiring an original data set to be classified, and converting each sample data in the original data set into a numerical vector; arranging the numerical vectors of all samples to construct a data matrix; Initializing a clustering center matrix according to a preset clustering category number, initializing an anchor matrix according to a preset anchor number, and setting a fuzzy coefficient; Based on the data matrix, the anchor matrix and the clustering center matrix, performing preset joint iterative optimization to obtain an optimized anchor matrix and an optimized clustering center matrix; Calculating fuzzy membership degree of each sample vector in the data matrix to each clustering center according to the optimized clustering center matrix; Determining the category with the maximum fuzzy membership value corresponding to each sample vector in the data matrix, taking the category as the final category label of the sample vector, outputting the final category labels of all the sample vectors as classification results, wherein, The preset joint iteration optimization comprises a plurality of iterations, wherein the iterations execute the following steps until convergence: Calculating a first distance metric between the data matrix and the anchor matrix, a second distance metric between the anchor matrix and the cluster center matrix, and a third distance metric between the data matrix and the cluster center matrix; Determining a gradient of the anchor matrix according to the first distance measure and the second distance measure; determining a gradient of the cluster center matrix according to the second distance measure and the third distance measure; Updating the numerical value stored in the clustering center matrix according to the gradient of the clustering center matrix to obtain the optimized clustering center matrix; the first distance measure, the second distance measure and the third distance measure are calculated to jointly form a joint optimization target, wherein the third distance measure forms a leading term of the joint optimization target, and the first distance measure and the second distance measure form an auxiliary term for guiding an optimization process; the fuzzy membership degree The calculation formula of (2) is as follows: wherein c is represented as a cluster category number, r is represented as a fuzzy coefficient, Representing the ith sample vector in the data matrix X, And All are represented as cluster centers in a cluster center matrix M, and j and k are index variables; The first distance measurement, the second distance measurement and the implementation third distance measurement are optimized through a joint optimization objective function L, wherein the joint optimization objective function L is as follows: wherein Z is represented as an anchor matrix, The j anchor point vector is expressed as the j anchor point vector in the anchor point matrix Z, and n is the number of samples; The determining the gradient of the anchor matrix and the updating the numerical value stored in the anchor matrix, the determining the gradient of the cluster center matrix and the updating the storage in the cluster center matrix are all realized by a gradient descent method, wherein the gradient of the anchor matrix And the gradient of the cluster center matrix All are calculated according to the joint optimization objective function L; anchor points in the anchor point matrix The updated formula of (2) is: ; the clustering centers in the clustering center matrix The updated formula of (2) is: ; Wherein, the Expressed as a learning rate, t is the t-th iteration.
2. The data classification method based on anchor point guided clustering according to claim 1, wherein the value of the fuzzy coefficient is greater than 1.
3. An apparatus for performing the data classification method based on anchor-directed clustering of claim 1 or 2, comprising: The data acquisition and construction module is configured to acquire an original data set to be classified, convert each sample data in the original data set into a numerical vector, arrange the numerical vectors of all samples and construct a data matrix; the parameter initialization module is configured to execute the initialization of a clustering center matrix according to the preset clustering category number, the initialization of an anchor matrix according to the preset anchor number and the setting of a fuzzy coefficient; the joint iteration optimization module is configured to execute preset joint iteration optimization based on the data matrix, the anchor matrix and the clustering center matrix to obtain an optimized anchor matrix and an optimized clustering center matrix; The fuzzy membership calculation module is configured to execute the fuzzy membership of each sample vector in the data matrix to each clustering center according to the optimized clustering center matrix; The classification decision and output module is configured to determine the class with the maximum fuzzy membership value corresponding to each sample vector in the data matrix, and serve as a final class label of the sample vector, and output the final class label of all the sample vectors as a classification result; the joint iterative optimization module is further configured to perform a plurality of iterations until convergence; wherein, the joint iteration optimization module comprises: a distance calculation unit configured to perform calculation of a first distance measure between the data matrix and the anchor matrix, a second distance measure between the anchor matrix and the cluster center matrix, a third distance measure between the data matrix and the cluster center matrix; A gradient determination unit configured to perform determining a gradient of the anchor matrix according to the first distance measure and the second distance measure; A value updating unit configured to perform updating of the value stored in the anchor matrix according to the gradient of the anchor matrix; and updating the numerical values stored in the clustering center matrix according to the gradient of the clustering center matrix to obtain the optimized clustering center matrix.
4. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the anchor-guided clustering-based data classification method of any of claims 1-2.
5. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the data classification method based on anchor-directed clustering of any one of claims 1-2.
6. A computer program product comprising a computer program which, when executed by a processor, implements the data classification method based on anchor-directed clustering of any one of claims 1-2.

Description

Data classification method, device and equipment based on anchor point guide clustering Technical Field The application relates to the technical field of electric digital data processing, in particular to a data classification method, device and equipment based on anchor point guided clustering. Background In many technical scenarios of automated analysis of unlabeled data, such as image content recognition, industrial sensor data classification or biometric classification, cluster analysis is a key data processing technique. The fuzzy C-means clustering is used as a technical scheme capable of outputting probability values of samples belonging to various categories, and is widely applied to the technical scene because the fuzzy C-means clustering can better process data with unclear category boundaries. Fuzzy C-means clustering is a distance-based soft clustering algorithm, which uses fuzzy membership to distribute data points into a plurality of clusters, and uses fuzzy membership to represent the probability that a point belongs to each cluster, and finally realizes the classified output of data. However, in practical application, the problem of unstable quality of the finally output classification result is found when the existing fuzzy C-means clustering technology is adopted for data processing. The method is characterized in that the data processing process is extremely sensitive to initial conditions of calculation parameters such as a clustering center, different initial settings can lead to classification results with obvious differences, and the data processing process is extremely sensitive to noise and abnormal values, so that the stability of data classification is affected. In addition, the data processing flow in the prior art can cause the model to converge to a suboptimal local minimum in the optimization process, and the calculation track of the model is easy to shape prematurely, which limits the capability of the model to obtain more accurate classification results on complex data. In the prior art, the data processing flow is sensitive to initial conditions and is easy to fall into suboptimal solutions, so that the final data classification result is inaccurate and unstable. Disclosure of Invention The application aims to provide a data classification method, device and equipment based on anchor point guided clustering, which are used for converting an optimization variable from a huge membership matrix to an anchor point and a center matrix with much smaller scale by constructing a combined optimization data matrix, an anchor point matrix and a clustering center matrix, so that the technical problems that the classification result is unstable and inaccurate due to the fact that the existing fuzzy clustering method is sensitive to initial conditions and is easy to fall into suboptimal solution, and the classification result cannot be solved directly by using a gradient descent algorithm so as to be difficult to be suitable for a large-scale data set are solved, and the enhancement of the classification precision of complex data and the applicability under different data scale scenes are realized. In order to achieve the above object, the present application provides the following solutions: According to a first aspect, the application provides a data classification method based on anchor point guided clustering, which comprises the steps of obtaining an original data set to be classified, converting each sample data in the original data set into a numerical vector, arranging the numerical vectors of all samples into a data matrix, initializing a clustering center matrix according to a preset clustering category number, initializing an anchor point matrix according to a preset anchor point number, setting a fuzzy coefficient, performing preset joint iterative optimization based on the data matrix, the anchor point matrix and the clustering center matrix, obtaining an optimized anchor point matrix and an optimized clustering center matrix, calculating the fuzzy membership degree of each sample vector in the data matrix to each clustering center according to the optimized clustering center matrix, determining the category with the largest fuzzy membership value corresponding to each sample vector in the data matrix, and outputting the final category label of all sample vectors as a classification result, wherein the joint iterative optimization comprises the steps of calculating a plurality of times of convergence coefficients, the anchor point matrix and the clustering center matrix, determining a distance between the first clustering center matrix and the second clustering center matrix according to a first gradient, the distance between the first clustering center matrix and the second clustering center matrix, and the clustering center gradient, and determining a distance between the first clustering center matrix and the clustering center matrix according to a second gradient, updati